Your Impact:
We are seeking a Staff Site Reliability Engineer (Infrastructure & Site Reliability Engineering) with extensive experience in AWS, AZURE, Kubernetes, and GitOps to lead our Site Reliability Engineering (SRE) team. The successful candidate will deeply understand SRE practices and have a track record of implementing high-quality site reliability engineering practices (SLAs, SLOs, Proactive Alert Management, Incident Response/Review, Postmortems, etc.).
In this role, you will work with our SRE and cross-functional engineering teams to develop and operate our development and production infrastructure and operations
Your Role
* Work collaboratively with software engineering on infrastructure and deployment requirements;
* Contribute actively and assist in our automation and observability initiatives
* Build and maintain operational tools for deployment, monitoring, and analysis of cloud (AWS & AZURE) infrastructure and systems
* Collaborate with senior team members in responding to production incidents, actively contribute to postmortems, and engage in continuous improvement efforts as part of on-call rotations for exposure to critical issue resolution
* Establish and drive operations performance through SLOs
* Provide project management, sprint planning, and road-mapping support to the SRE team
* Expert-level technical skills and ability to provide mentoring to team members
* Our team uses practices to maximize our development velocity, including but not limited to: continuous integration/deployment, code review via GitHub pull requests
Your Experience
* Strong customer orientation
* Excellent interpersonal and organizational skills
* Attention to detail and focus on quality
* Strong communication skills to effectively liaise with both technical and non-technical staff
* Ability to act decisively and work well under pressure
* Must be a collaborative problem solver
* Strong bias for ownership and action
Qualifications:
* At least 10 + years of experience designing, building, and maintaining SAAS environments
* 6+ years of experience designing, building and maintaining AWS/AZURE infrastructure with Terraform
* 3+ years of experience building and running Kubernetes, Clickhouse, MySQL, and Kafka clusters
* Experience with observability (monitoring - logging, tracing, metrics)
* Experience with GitOps CI/CD processes
* Experience with scripting with Python, Go (Golang), bash, or PowerShell and AWS CLI tools
* Experience with security operations - security policies, infrastructure, key management, setup of encryption at rest, and transport