We are seeking a Senior Site Reliability Engineer (SRE) specializing in Observability to join RivianVW's Data Platform - Production Engineering team. In this role, you will design, implement, and scale robust observability systems to ensure the health, performance, and reliability of our production environment. You will collaborate closely with cross-functional teams to create telemetry solutions that provide actionable insights into our distributed systems.
Observability Platform Design: Architect, implement, and maintain observability systems, leveraging tools like Datadog, LGTM stack, OpenTelemetry, and Vector to enable real-time performance monitoring, logging, and alerting.
Telemetry Optimization: Evolve and scale telemetry pipelines to ensure low latency and high availability for metrics, logs, and traces across multi-cloud environments.
Performance Engineering: Proactively identify performance bottlenecks, optimize systems, and provide recommendations for reliability improvements.
Scalable Automation: Implement automation solutions to scale systems sustainably while driving improvements in reliability and deployment velocity.
Incident Management: Collaborate with the incident response team to establish data-driven debugging and troubleshooting processes using observability data.
Tooling Development: Create and maintain self-service observability tools and dashboards to empower teams across the organization.
Cross-functional Collaboration: Partner with development, DevOps, and infrastructure teams to define SLOs/SLIs and ensure observability is embedded throughout the software lifecycle.
Educational Background: Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
Experience: 5+ years in Site Reliability Engineering or a related role with a strong emphasis on observability.
Technical Expertise:
Proficiency in designing and operating observability platforms with tools like Prometheus, Grafana, Loki, Jaeger, or Datadog.
Experience with OpenTelemetry and distributed tracing in microservices architectures.
Deep knowledge of Kubernetes (e.g., EKS), ArgoCD, and Crossplane.
Programming Skills: Strong proficiency in Python, Go, or similar languages for building automation and custom telemetry solutions.
Cloud & Systems: Familiarity with multi-cloud setups, containerization (Docker), and Linux system fundamentals.
Soft Skills: Exceptional problem-solving, communication, and a data-driven approach to decision-making.