Uber

Sr Cloud Reliability Engineer, Platform Engineering

Amsterdam, NH, NL

23 days ago
Save Job

Summary

About The Role

Platform Engineers at Uber blend software and systems engineering capabilities to ensure Uber's services run reliably and can scale to meet our significant needs. Platform Engineering holds a high bar for reliability, efficiency, and resilience across Uber's Tech platforms, ensuring production systems maintain the highest levels of availability.

We are seeking a highly skilled Cloud Reliability Engineer to join our global team. This role will be a part of an evolving team and set of internal processes that focus on overseeing service reliability across multiple cloud providers, ensuring service levels are met, and driving incident response best practices. You will work closely with internal engineering teams and cloud partners to proactively detect, manage, drive root cause while continuously improving our processes.

What You Will Do

  • Incident Management & Response: Lead cloud incident management efforts, ensuring rapid detection, triage, and resolution across all cloud platforms.
  • Root Cause Analysis & SLA Compliance: Evolve key process to ensure cloud incident RCAs are completed within the agreed Service Level Agreements, track all action items, and drive continuous improvement in cloud reliability.
  • Monitoring & Automation: Unify automated monitoring, alerting mechanisms, and centralized incident logging to improve detection and response times.
  • Reporting & Insights: Develop targeted reporting to provide directly relevant cloud reliability insights.
  • Continuous Improvement: Identify patterns in incidents, optimize response playbooks, and enhance incident management frameworks for ongoing operational resilience.

Basic Qualifications

  • 5+ years of experience in cloud incident management, SRE, or operations
  • Expertise in a multi-cloud environments
  • Experience with incident detection, response, and RCA processes
  • Strong analytical and problem-solving skills, with the ability to work under pressure.
  • Excellent communication and stakeholder management skills.

Preferred Qualifications

  • Certifications in cloud platforms
  • Hands-on experience with incident escalation procedures and service recovery plans.
  • Experience with automated logging and forensic analysis tools
  • Familiarity with SLAs, compliance, and audit processes
  • Prior experience working in a highly scalable global organization

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job