Sr Cloud Reliability Engineer, Platform Engineering
Amsterdam, NH, NL
23 days ago
Save Job
Summary
About The Role
Platform Engineers at Uber blend software and systems engineering capabilities to ensure Uber's services run reliably and can scale to meet our significant needs. Platform Engineering holds a high bar for reliability, efficiency, and resilience across Uber's Tech platforms, ensuring production systems maintain the highest levels of availability.
We are seeking a highly skilled Cloud Reliability Engineer to join our global team. This role will be a part of an evolving team and set of internal processes that focus on overseeing service reliability across multiple cloud providers, ensuring service levels are met, and driving incident response best practices. You will work closely with internal engineering teams and cloud partners to proactively detect, manage, drive root cause while continuously improving our processes.
What You Will Do
Incident Management & Response: Lead cloud incident management efforts, ensuring rapid detection, triage, and resolution across all cloud platforms.
Root Cause Analysis & SLA Compliance: Evolve key process to ensure cloud incident RCAs are completed within the agreed Service Level Agreements, track all action items, and drive continuous improvement in cloud reliability.
Monitoring & Automation: Unify automated monitoring, alerting mechanisms, and centralized incident logging to improve detection and response times.
Reporting & Insights: Develop targeted reporting to provide directly relevant cloud reliability insights.
Continuous Improvement: Identify patterns in incidents, optimize response playbooks, and enhance incident management frameworks for ongoing operational resilience.
Basic Qualifications
5+ years of experience in cloud incident management, SRE, or operations
Expertise in a multi-cloud environments
Experience with incident detection, response, and RCA processes
Strong analytical and problem-solving skills, with the ability to work under pressure.
Excellent communication and stakeholder management skills.
Preferred Qualifications
Certifications in cloud platforms
Hands-on experience with incident escalation procedures and service recovery plans.
Experience with automated logging and forensic analysis tools
Familiarity with SLAs, compliance, and audit processes
Prior experience working in a highly scalable global organization
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job