1. Technical Expertise
* Deep understanding of SRE principles, SRE model, and DevOps methodologies.
* Experience designing highly available, scalable, and resilient distributed systems.
* Proficient in architectural design (Microservices, Cloud-native, Event-driven architecture).
* Skilled in cloud platforms: Azure, GCP.
* Strong knowledge of observability tools: UIM, Prometheus, Grafana, Datadog, New Relic, Splunk, AppDynamics.
2. Framework Design & Governance
* Define and validate SLOs, SLIs, SLAs, error budgets, and availability targets.
* Design runbooks, escalation policies, and chaos testing frameworks.
* Create reusable templates for observability, alerting, and logging.
* Ensure compliance and audit readiness.
3. Communication & Cross-Functional Leadership
* Collaborate with architects, designers, platform and infra teams.
* Document frameworks and lead adoption across teams.
* Review designs and validate reliability criteria.
Roles & Responsibilities:
1. Framework & Standardization
* Define and maintain the SRE operating model, framework, and onboarding guide.
* Create templates and reference architectures for observability, alerting, and runbooks.
* Standardize definitions of availability, reliability, latency, and performance.
2. Architectural Integration
* Participate in application architecture reviews to validate SRE compliance.
* Recommend design patterns for fault tolerance, failover, auto-scaling, and DR.
* Define observability-by-design principles.
3. Governance, Audit & Optimization
* Establish and lead SRE councils or review boards.
* Define SRE maturity models, scorecards, and compliance checks.
* Perform SRE audits across product portfolios.
* Guide teams on capacity modeling, load distribution, and cost-efficiency strategies.
* Collaborate with platform teams on resource reservations and right-sizing.
4. Tool Rationalization & Strategy
* Evaluate and recommend standard SRE toolchains for monitoring, logging, tracing.
* Own the integration strategy across observability platforms.
5. Training, Leadership & Evangelism
* Conduct SRE bootcamps for application and infra teams.
* Champion a blameless culture and continuous improvement mindset.
* Drive Error Budget policies and reliability trade-off discussions.
* Mentor product teams on SRE integration strategies.
* Influence architectural decisions with SRE perspectives.
#LI-RJ2
Salary Range-$110,000-$125,000 a year