Job Description :
We seek a talented SRE/DevOps Engineer to drive reliability, scalability, and performance of critical systems in Azure Core, specifically focusing on Office 365 buildouts. The ideal candidate will have hands-on experience in cloud infrastructure, automation, monitoring, and incident management. You will be responsible for deploying resources in public and sovereign clouds, troubleshooting complex system issues, and working with large datasets to generate operational insights.
Responsibilities:
· Participate in on-call rotations, responding to production incidents during non-business hours, weekends, and holidays as needed. Manage and resolve system incidents by leading incident bridges, troubleshooting, and
driving resolution.
· Continuously monitor system performance using telemetry tools to identify and resolve potential issues before they impact service reliability. Ensure all performance metrics remain within acceptable limits and drive towards KPIs.
· Maintain automation tools, reducing manual efforts and increasing reliability.
· Lead and execute buildouts, ensuring timely deliveries, and troubleshooting deployment issues.
· Analyze operational data, create dashboards, and report on system chokepoints, throughput, and performance. Identify areas for cycle time reduction and incident toil minimization.
· Conduct postmortem reviews and lead blameless post-incident reviews to determine root cause and improve service resiliency. Implement preventive measures to avoid repeat issues.
· Create and maintain comprehensive documentation, including technical procedures, playbooks, and TSGs, to help streamline incident response and improve operational knowledge sharing.
Requirements
· Bachelor’s degree in computer science, or related technical discipline
· Strong experience with Azure DevOps, including subscription
management, Azure Portal, CLI, and deploying Azure Virtual Machines.