Job Overview:
We are seeking a talented and experienced Site Reliability Engineer (SRE) to help scale, support, and improve the reliability and performance of our infrastructure and services. You’ll collaborate with development and operations teams to build and maintain robust, scalable systems, automate routine tasks, and ensure our applications meet performance and uptime goals.
Key Responsibilities:
• Design, build, and maintain reliable infrastructure and tools to support application development and deployment.
• Develop and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
• Implement monitoring, alerting, and incident response systems.
• Automate manual processes, such as deployments, testing, and configuration management.
• Troubleshoot infrastructure and application issues across the stack.
• Participate in on-call rotation and root cause analysis for production incidents.
• Collaborate with development teams to ensure new features meet reliability and scalability goals.
• Improve CI/CD pipelines and deployment strategies.
Qualifications:
• Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience.
• 3+ years of experience in systems engineering, DevOps, or site reliability engineering.
• Proficiency with cloud platforms (e.g., AWS, GCP, Azure).
• Strong experience with containerization (e.g., Docker, Kubernetes).
• Familiarity with monitoring tools (e.g., Prometheus, Grafana, Datadog).
• Experience with infrastructure-as-code (e.g., Terraform, CloudFormation).
• Strong scripting skills (e.g., Python, Bash, Go).
• Solid understanding of Linux/Unix systems, networking, and security principles.
Preferred Qualifications:
• Experience with high-availability, distributed systems.
• Knowledge of databases (SQL and NoSQL).
• Experience handling incidents and postmortem analysis.