Define and implement SRE strategies aligned with business and technology objectives.
Act as a trusted advisor to executive leadership, influencing reliability, observability, and automation initiatives.
Collaborate with engineering, cloud, DevOps, security, and platform teams to drive reliability and resilience roadmaps.
Conduct reliability assessments, risk analysis, and gap identification for continuous service improvement.
Lead the adoption of SRE culture across the organization, evangelizing reliability engineering principles.
Site Reliability & Observability Architecture :
Architect and implement scalable observability solutions including APM, logs, traces, and metrics (e.g., Prometheus, Grafana, Datadog, New Relic, Splunk).
Develop a unified monitoring and alerting framework that integrates real-time insights and automated response mechanisms.
Establish and refine SLOs, SLIs, and error budgets to enhance service reliability.
Optimize incident management and root cause analysis using AI-driven observability and predictive analytics.
Incident & Service Management :
Define and implement best practices for incident response, post-mortems, and problem resolution.
Improve MTTR (Mean Time to Repair) and MTTF (Mean Time to Failure) through proactive automation and analytics-driven insights.
Develop robust escalation and alerting policies, reducing noise and improving signal-to-noise ratios in monitoring.
Drive RPA (Robotic Process Automation) and workflow automation to eliminate repetitive, manual operational tasks.
Toil Elimination & Automation :
Identify and eliminate operational toil using self-healing infrastructure, runbooks automation, and auto-remediation workflows.
Champion AI/ML-based predictive analytics for anomaly detection, capacity planning, and proactive incident prevention.
Develop CI/CD-driven operational automation for reducing manual interventions in deployments and rollbacks.
Build and lead initiatives in AI Ops, ChatOps, and ITSM automation to streamline support operations.
Talent Development & Technical Leadership
Mentor, coach, and grow high-performing SRE teams, fostering a culture of innovation and continuous learning.
Drive SRE training programs, workshops, and certifications to upskill engineers on modern reliability practices.
Establish and promote career development frameworks for SRE engineers at different levels.
Cultivate an environment of psychological safety, collaboration, and shared responsibility for reliability.
Governance, Compliance & Cost Optimization :
Ensure governance and compliance with regulatory requirements (e.g., ISO 27001, SOC 2, NIST, ITIL).
Optimize cloud cost efficiency through effective capacity planning, autoscaling, and FinOps principles.
Define policies for resilience engineering, chaos engineering experiments, and disaster recovery planning.
Work closely with InfoSec teams to implement security monitoring and threat detection capabilities.
Required Qualifications & Experience
Technical Expertise :
15+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
Expertise in monitoring & observability tools like Datadog, Prometheus,
Grafana, New Relic, AppDynamics, Splunk, OpenTelemetry.
Hands-on experience with incident management, service management (ITIL), and automation platforms.
Strong background in toil elimination and workflow automation using RPA, AI Ops, and event-driven automation.
Proficiency in programming (Python, Go, Java, or similar) for scripting and automation.
Experience with Kubernetes, Service Mesh (Istio, Linkerd), and cloud-native architectures.
Deep understanding of SLOs, SLIs, error budgets, and reliability engineering principles.
Expertise in Cloud Platforms (AWS, Azure, GCP), including serverless, containerization, and networking.
Strong understanding of predictive analytics, AI/ML for anomaly detection, and self-healing systems.
Leadership & Consulting Skills
Proven experience leading large-scale SRE teams and driving enterprise-wide reliability initiatives.
Ability to influence executive stakeholders and drive strategic decision- making.
Experience mentoring, coaching, and developing engineering talent.
Exceptional problem-solving and incident management skills with a data-driven approach.
Strong communication, documentation, and storytelling abilities to convey reliability insights.
Preferred Qualifications
Certification in SRE (Google SRE, CRE), AWS/Azure/GCP Architect, ITIL, or TOGAF.
Experience with AI-driven IT operations (AIOps) and generative AI-based observability.