This job has expired

Check similar jobs, what people also searched, or create a job alert for Staff Site Reliability Engineer jobs in Palo Alto, CA

Rivian

Staff Site Reliability Engineer

Palo Alto, CA

25 days ago

Save Job

Summary

We are seeking a highly skilled and experienced Senior Site Reliability Engineer (SRE) to join our team. In this critical role, you will be instrumental in ensuring the reliability, performance, and scalability of our complex distributed systems. You will act as a central point of expertise, focusing on incident coordination, providing a crucial backstop for service owners, and proactively engaging with development teams to enhance service maturity. A strong understanding of our distributed architecture, data flows, and the ability to design innovative technology patterns to address challenges at scale are essential for success in this role.

Incident Coordination: Lead and manage the end-to-end process for major incidents, ensuring effective communication, clear roles and responsibilities, and timely resolution. Drive post-incident reviews to identify root causes and implement preventative measures.
On-Call Backstop: Serve as an escalation point and provide expert support during on-call rotations for service owners, particularly for complex or systemic issues.
Service Maturity Engagement: Proactively engage with development teams on short-term projects to improve the reliability, observability, and operational excellence of their services. This includes guidance on SLO/SLI definition, error budgeting, monitoring strategies, and automation.
Distributed System Architecture Expertise: Maintain a comprehensive and up-to-date mental model of RVT's distributed system architecture, including inter-service dependencies, data flows, and critical infrastructure components.
Technology Pattern Design: Identify recurring challenges and design new, scalable technology patterns and best practices to improve the reliability, efficiency, and resilience of our systems. This may involve exploring new technologies and advocating for their adoption.
Performance Optimization: Analyze system performance, identify bottlenecks, and collaborate with development teams to implement optimizations for latency, throughput, and resource utilization.
Platform Conditioning: Contribute to capacity planning efforts by analyzing trends, defining saturation points of the system, and recommending scaling strategies.
Mentorship: Mentor and guide junior SRE team members, fostering a culture of learning and knowledge sharing.

Deep understanding of modern distributed systems principles, including microservices, Kubernetes, and cloud-native architectures.
Experience working with cell-based architectures and managing IoT environments at scale, including understanding the unique challenges and considerations of these systems.
Software development experience in modern programming languages such as Rust and Golang, with a strong understanding of software development lifecycle and best practices.
Extensive experience in designing and coordinating incident response plans and processes at scale in large cloud environments (e.g., AWS, Azure, GCP).
Strong knowledge of data platforms, including relational and NoSQL databases, data warehousing concepts, and data governance.
Experience with real-time streaming technologies (e.g., Kafka, Flink) and data lake architectures (e.g., S3, ADLS, Data Lake Storage).
Familiarity with global traffic management techniques (e.g., DNS-based routing, load balancing strategies, CDN).
Proficiency with observability tools and practices, including monitoring, logging, tracing, and alerting (e.g., Prometheus, Grafana, ELK stack, Datadog).
Excellent troubleshooting and analytical skills, with the ability to diagnose complex issues in distributed environments.
Strong communication and collaboration skills, with the ability to effectively communicate technical concepts to both technical and non-technical audiences.
Experience with infrastructure-as-code (IaC) tools like Crossplane or Terraform.
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.

This job has expired

Rivian

Staff Site Reliability Engineer

Palo Alto, CA

Summary

How strong is your resume?

How strong is your resume?

MORE JOBS LIKE THIS

People also searched:

Our Company

Career Guides

Career Advice

Support