HighLevel

Site Reliability Engineer

Delhi, IN

28 days ago
Save Job

Summary

We are looking for a Site Reliability Engineer to join our team and help ensure the availability, performance, and scalability of our critical systems. You will work closely with development and operations teams to automate processes, enhance system reliability, and improve observability.

Requirements

  • Experience: 4+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
  • Cloud Expertise: Hands-on experience with GCP and AWS
  • Infrastructure as Code (IaC): Terraform, Helm, or equivalent tools
  • Containerisation & Orchestration: Docker, Kubernetes (GKE)
  • Observability: Experience with Prometheus, Grafana, ELK, OpenTelemetry, or similar monitoring/logging tools
  • Programming/Scripting: Proficiency in Python, Bash, or Shell scripting. Basic understanding of API parsing and JSON manipulation
  • CI/CD Pipelines: Hands-on experience with Jenkins, GitHub Actions, ArgoCD, or similar tools
  • Incident Management: Experience with on-call rotations, SLOs, SLIs, SLAs, Escalation Policies, and incident resolution
  • Databases: Experience in monitoring MongoDB, Redis, ES, Queue based etc

Responsibilities

  • Develop and improve observability using monitoring, logging, tracing, and alerting tools (Prometheus, Grafana, ELK, OpenTelemetry, etc.)
  • Optimize system performance, troubleshoot incidents, and conduct post-mortems/RCA to prevent future issues
  • Collaborate with developers to enhance application reliability, scalability, and performance
  • Drive cost optimisation efforts in cloud environments.
  • Monitor multiple databases (MongoDB, Redis, ES, Queue based etc.)

Skills:- prometheus, grafana, ELKI, Kubernetes, Terraform, Docker, Amazon Web Services (AWS), Google Cloud Platform (GCP), Python, Bash and Shell Scripting

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job

People also searched: