Vallum Associates

Senior Site Reliability Engineer

Atlanta, GA, US

2 days ago
Save Job

Summary

Job Description:


We are seeking a Senior Site Reliability Engineer (SRE) to join a high-performing team focused on ensuring the reliability, availability, and performance of enterprise systems. The ideal candidate will bring deep experience with Microsoft Azure cloud infrastructure and hands-on expertise in Dynatrace monitoring and observability tools. In this hybrid role based in Atlanta, GA, you will work closely with development and operations teams to proactively manage system health, implement observability best practices, and drive continuous improvements in service stability and reliability.


Key Responsibilities:

  • Monitor, maintain, and enhance system performance, availability, and scalability across cloud infrastructure.
  • Design, implement, and manage observability solutions using Dynatrace, enabling end-to-end monitoring of cloud-native applications and services.
  • Define, track, and enforce SLOs (Service Level Objectives) and SLIs (Service Level Indicators) to align with business requirements and user expectations.
  • Collaborate with software development and DevOps teams to implement reliability and performance improvements through automation and proactive monitoring.
  • Participate in incident response, root cause analysis, and continuous improvement efforts.
  • Provide guidance on best practices for system instrumentation, logging, and telemetry in cloud environments.


Required Skills & Qualifications:

  • 5+ years of experience in Site Reliability Engineering (SRE) or a related DevOps/Operations role.
  • Strong hands-on experience with Microsoft Azure cloud services and infrastructure management.
  • Proven expertise in implementing and managing monitoring and observability platforms, with a strong focus on Dynatrace.
  • Solid understanding of cloud-native architectures, containerized environments, and modern CI/CD pipelines.
  • Familiarity with incident management, postmortem analysis, and operational readiness practices.
  • Experience defining and managing SLOs/SLIs, and using data-driven approaches to improve system reliability.
  • Excellent problem-solving skills, communication abilities, and a collaborative mindset.


Nice to Have:

  • Experience with infrastructure as code (e.g., Terraform, ARM templates).
  • Knowledge of additional monitoring tools (e.g., Prometheus, Grafana) and scripting languages (e.g., Python, PowerShell).

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job

People also searched: