At SolarWinds, we’re a people-first company. Our purpose is to enrich the lives of the people we serve—including our employees, customers, shareholders, Partners, and communities. Join us in our mission to help customers accelerate business transformation with simple, powerful, and secure solutions.
The ideal candidate thrives in an innovative, fast-paced environment and is collaborative, accountable, ready, and empathetic. We’re looking for individuals who believe they can accomplish more as a team and create lasting growth for themselves and others. We hire based on attitude, competency, and commitment. Solarians are ready to advance our world-class solutions in a fast-paced environment and accept the challenge to lead with purpose. If you’re looking to build your career with an exceptional team, you’ve come to the right place. Join SolarWinds and grow with us!
About the Role:As a Senior Staff Site Reliability Engineer, you will play a pivotal role in driving reliability and performance improvements across the SolarWinds Observability Platform. You will work closely with cross-functional engineering teams to manage and reduce SaaS backlogs, ensuring that our platform scales effectively while maintaining the highest standards of reliability and performance. Your ability to drive initiatives, provide technical leadership, and optimize complex systems will be key to our success.
This role demands deep technical expertise, a collaborative mindset, and the ability to mentor a high-performing team of engineers. You will be responsible for driving technical initiatives, overseeing incident response, and improving our platform's infrastructure while focusing on the integration of emerging technologies such as ClickHouse, Kafka, Karpenter, and Buf.
Key Responsibilities:
- Lead and Drive Initiatives: Own and lead strategic initiatives to improve the reliability, scalability, and performance of the SolarWinds Observability Platform, with a strong focus on reducing SaaS backlogs.
- SaaS Backlog Management: Collaborate with cross-functional teams to identify, prioritize, and address outstanding backlog items, including incidents, infrastructure improvements, performance optimization, and automation.
- Automation & Observability: Lead the development of automation strategies and observability tools to improve platform monitoring, reduce incidents, and enhance performance insights across the infrastructure.
- Incident Response & Postmortems: Lead response efforts for production incidents, conducting thorough postmortems, driving continuous improvement initiatives, and ensuring the team learns from each incident.
- Platform Engineering Leadership: Drive initiatives related to platform engineering and scale infrastructure systems, ensuring they meet the reliability and performance standards necessary for the SolarWinds Observability Platform.
- Mentorship & Team Leadership: Mentor and provide technical guidance to the Site Reliability Engineering (SRE) team, helping them grow their skills and driving a culture of continuous learning and collaboration.
- Collaboration & Cross-Functional Engagement: Collaborate closely with engineering, security, and product teams to ensure the seamless integration of new technologies and systems, improving platform reliability and scalability.
Ideal Candidate Attributes:
- Strong Leadership Skills: Proven ability to drive initiatives, manage SaaS backlogs, and lead cross-functional teams to successful outcomes.
- Collaborative Mindset: Comfortable working with diverse teams across different functions to solve complex problems and build scalable, high-performance systems.
- Customer-Focused: A strong customer orientation, with the ability to translate technical challenges into business solutions.
- Excellent Communication: Strong interpersonal and communication skills to effectively engage with both technical and non-technical stakeholders.
- Problem-Solving & Ownership: A collaborative problem solver with a strong bias for ownership and decisive action.
Qualifications:
- 13+ years of experience in Site Reliability Engineering, Platform Engineering, or related roles, with extensive experience managing SaaS environments.
- 8+ years of experience designing, building, and maintaining AWS/Azure infrastructure, using Terraform and automation tools.
- 5+ years of experience building, running, and scaling Kubernetes clusters in production environments.
- Experience with Observability tools (e.g., monitoring, logging, tracing, metrics) and practices for high-performance systems.
- Strong expertise with Kafka for real-time data processing, ClickHouse for OLAP workloads, and GitOps CI/CD processes.
- Familiarity with Karpenter for Kubernetes autoscaling, and Buf for managing Protocol Buffers at scale is a plus.
- Programming experience in Python, Go (Golang), and Bash.
- Security Operations Experience: Knowledge of security best practices for cloud-native environments, including encryption, key management, and security policies.
- Mentorship experience: Demonstrated success in mentoring and growing technical teams, fostering a culture of collaboration and continuous learning.
SolarWinds is an Equal Employment Opportunity Employer. SolarWinds will consider all qualified applicants for employment without regard to race, color, religion, sex, age, national origin, sexual orientation, gender identity, marital status, disability, veteran status or any other characteristic protected by law.
All applications are treated in accordance with the SolarWinds Privacy Notice: https://www.solarwinds.com/applicant-privacy-notice