Job Description: Site Reliability Engineer (SRE) (AWS + Kubernetes + Python)
START: ASAP
Duration: Long-term, full-time
Summary:
We are seeking a Site Reliability Engineer (SRE) with expertise in AWS, Kubernetes, and Python to ensure the reliability, scalability, and performance of mission-critical applications. The ideal candidate will focus on automation, observability, and incident response, working closely with development and operations teams to improve system reliability and efficiency.
Compensation:
•Full-time (W2): $130K – $160K/year + benefits
•Contract: $100–$110/hour
Responsibilities:
• Build and maintain scalable, highly available infrastructure on AWS
•Automate infrastructure provisioning using Terraform, Ansible, or CloudFormation
•Monitor system performance and troubleshoot incidents using Prometheus, Grafana, and ELK Stack
•Optimize Kubernetes clusters (EKS, GKE, AKS) for reliability and performance
•Develop and maintain CI/CD pipelines for seamless deployments
• Implement disaster recovery, failover strategies, and high availability solutions
•Ensure observability, logging, and tracing across distributed systems
• Collaborate with developers to design self-healing and fault-tolerant architectures
•Conduct post-mortems and root cause analysis for production incidents
Qualifications:
•7+ years of experience in SRE, DevOps, or cloud infrastructure engineering
•Strong knowledge of AWS services (EC2, Lambda, S3, RDS, IAM, VPC, etc.)
• Experience with Kubernetes, Helm, and container orchestration
•Proficiency in Python, Bash, or Go for automation
•Familiarity with monitoring and logging tools (Datadog, Prometheus, New Relic)
• Experience implementing scaling, load balancing, and failover strategies
•Strong problem-solving skills and ability to work in a fast-paced environment
• Knowledge of security best practices, IAM, and cloud compliance
Compensation:
•Full-time (W2): $130K – $160K/year + benefits
•Contract: $100–$110/hour