VidPro Consultancy Services

Senior Engineering Leader - Site Reliability

West Bengal, IN

about 1 month ago
Save Job

Summary

Key Responsibilities

  • Strategic Leadership & Consulting :
  • Define and implement SRE strategies aligned with business and technology objectives.
  • Act as a trusted advisor to executive leadership, influencing reliability, observability, and automation initiatives.
  • Collaborate with engineering, cloud, DevOps, security, and platform teams to drive reliability and resilience roadmaps.
  • Conduct reliability assessments, risk analysis, and gap identification for continuous service improvement.
  • Lead the adoption of SRE culture across the organization, evangelizing reliability engineering principles.
  • Site Reliability & Observability Architecture :
  • Architect and implement scalable observability solutions including APM, logs, traces, and metrics (e.g., Prometheus, Grafana, Datadog, New Relic, Splunk).
  • Develop a unified monitoring and alerting framework that integrates real-time insights and automated response mechanisms.
  • Establish and refine SLOs, SLIs, and error budgets to enhance service reliability.
  • Optimize incident management and root cause analysis using AI-driven observability and predictive analytics.
  • Incident & Service Management :
  • Define and implement best practices for incident response, post-mortems, and problem resolution.
  • Improve MTTR (Mean Time to Repair) and MTTF (Mean Time to Failure) through proactive automation and analytics-driven insights.
  • Develop robust escalation and alerting policies, reducing noise and improving signal-to-noise ratios in monitoring.
  • Drive RPA (Robotic Process Automation) and workflow automation to eliminate repetitive, manual operational tasks.
  • Toil Elimination & Automation :
  • Identify and eliminate operational toil using self-healing infrastructure, runbooks automation, and auto-remediation workflows.
  • Champion AI/ML-based predictive analytics for anomaly detection, capacity planning, and proactive incident prevention.
  • Develop CI/CD-driven operational automation for reducing manual interventions in deployments and rollbacks.
  • Build and lead initiatives in AI Ops, ChatOps, and ITSM automation to streamline support operations.
  • Talent Development & Technical Leadership
  • Mentor, coach, and grow high-performing SRE teams, fostering a culture of innovation and continuous learning.
  • Drive SRE training programs, workshops, and certifications to upskill engineers on modern reliability practices.
  • Establish and promote career development frameworks for SRE engineers at different levels.
  • Cultivate an environment of psychological safety, collaboration, and shared responsibility for reliability.
  • Governance, Compliance & Cost Optimization :
  • Ensure governance and compliance with regulatory requirements (e.g., ISO 27001, SOC 2, NIST, ITIL).
  • Optimize cloud cost efficiency through effective capacity planning, autoscaling, and FinOps principles.
  • Define policies for resilience engineering, chaos engineering experiments, and disaster recovery planning.
  • Work closely with InfoSec teams to implement security monitoring and threat detection capabilities.

Required Qualifications & Experience

Technical Expertise :

  • 15+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
  • Expertise in monitoring & observability tools like Datadog, Prometheus,
  • Grafana, New Relic, AppDynamics, Splunk, OpenTelemetry.
  • Hands-on experience with incident management, service management (ITIL), and automation platforms.
  • Strong background in toil elimination and workflow automation using RPA, AI Ops, and event-driven automation.
  • Proficiency in programming (Python, Go, Java, or similar) for scripting and automation.
  • Experience with Kubernetes, Service Mesh (Istio, Linkerd), and cloud-native architectures.
  • Deep understanding of SLOs, SLIs, error budgets, and reliability engineering principles.
  • Expertise in Cloud Platforms (AWS, Azure, GCP), including serverless, containerization, and networking.
  • Strong understanding of predictive analytics, AI/ML for anomaly detection, and self-healing systems.

Leadership & Consulting Skills

  • Proven experience leading large-scale SRE teams and driving enterprise-wide reliability initiatives.
  • Ability to influence executive stakeholders and drive strategic decision- making.
  • Experience mentoring, coaching, and developing engineering talent.
  • Exceptional problem-solving and incident management skills with a data-driven approach.
  • Strong communication, documentation, and storytelling abilities to convey reliability insights.

Preferred Qualifications

  • Certification in SRE (Google SRE, CRE), AWS/Azure/GCP Architect, ITIL, or TOGAF.
  • Experience with AI-driven IT operations (AIOps) and generative AI-based observability.
  • Hands-on expertise in workflow orchestration tools (Airflow, Argo Workflows, Camunda, ServiceNow).
  • Familiarity with SRE in highly regulated industries such as finance, healthcare, or telecom.
  • Strong background in distributed systems, microservices, and API reliability engineering.

(ref:hirist.tech)

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job