Cloudologic is a prominent cloud consulting and IT service provider based in Singapore and rooted in India, focusing on cloud operations, cyber security, and managed services. With a decade of expertise, our dedication to delivering high-quality services has earned the trust of clients worldwide, making us a valued partner in the tech industry.
Role Description
This is a full-time onsite role for a Senior Site Reliability Engineer at Cloudologic. The SRE Specialist will be responsible for troubleshooting, software development, system administration, and infrastructure maintenance. While the role is based in Gurgaon, remote work is acceptable.
System Reliability & Performance
Ensure high availability, reliability, and scalability of services.
Implement SLOs (Service Level Objectives) and SLIs (Service Level Indicators).
Monitor system performance and proactively address bottlenecks.
Incident Management & Troubleshooting
Respond to incidents, conduct root cause analysis (RCA), and implement fixes.
Develop and improve monitoring, alerting, and diagnostic tools.
Conduct blameless postmortems to improve system resilience.
Automation & Infrastructure as Code (IaC). Automate deployments, scaling, and recovery processes.
Manage infrastructure using tools like Terraform, Ansible, or Kubernetes.
Implement CI/CD pipelines for seamless software & Monitoring :
Use monitoring tools (e.g., Prometheus, Grafana, Datadog, Splunk, ELK) to track system health.
Define and maintain dashboards and alerts for proactive system monitoring.
Security & Compliance. Implement security best practices, vulnerability scanning, and patch management
Ensure compliance with regulatory requirements (GDPR, ISO 27001, etc.).
Conduct security audits and risk assessments.
Capacity Planning & Cost Optimization
Forecast system demands and scale infrastructure accordingly.
Optimize cloud costs by managing resource utilization efficiently.
Work with development teams to build cost-effective solutions.
Collaboration & Documentation. Work closely with developers, DevOps, and IT teams to improve system reliability.
Document processes, best practices, and incident response playbooks.
Participate in on-call rotations and knowledge-sharing sessions.
Qualifications
Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
5+ years of experience in a Site Reliability Engineering, DevOps, or similar role.
Strong understanding of system reliability, performance, and scalability principles.
Proficiency in scripting languages (e.g., Python, Bash) and automation tools.
Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible, Kubernetes).
Expertise in monitoring and logging tools (e.g., Prometheus, Grafana, Datadog, Splunk, ELK).
Solid understanding of cloud platforms (AWS, Azure, GCP).
Experience with CI/CD pipelines and software release management.
Strong problem-solving and troubleshooting skills.
Excellent communication and collaboration skills.
- Knowledge of security best practices and compliance requirements. -
Preferred Qualifications
Experience with containerization and orchestration technologies (Docker, Kubernetes).
Experience with database administration and optimization.
Relevant certifications (e.g., AWS Certified DevOps Engineer, Google Cloud Certified Professional Cloud DevOps Engineer).
(ref:hirist.tech)
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job