Role Overview: This role is responsible for troubleshooting and resolving production incidents. This role acts as a bridge between the support and development teams, handling technical investigations, applying quick fixes, and escalating critical issues. By managing and resolving incidents effectively, this role allows the development team to focus on R&D and feature development.
Key Responsibilities:
Incident Management and Troubleshooting
Take ownership of production incidents, perform deep-dive investigations, and provide immediate resolutions or workarounds
Monitor production alerts, logs, and error notifications in real-time to ensure rapid incident response
Escalate unresolved issues to the development team only when necessary, minimizing their involvement in routine incidents
Document all production issues, resolutions, and lessons learned to improve troubleshooting efficiency
Develop and maintain incident response plans to ensure a structured troubleshooting approach
Collaboration and Support Enablement
Work closely with the support team to assist with technical escalations and ensure customer issues are addressed quickly
Coordinate with the development team to report recurring issues that need long-term fixes while reducing their direct involvement in incident handling
Communicate incident status, impact, and resolution progress to key stakeholders and leadership
System Monitoring and Performance Optimization
Monitor support emails, process failure notification emails, and Prometheus alerts to proactively detect or prevent incidents before they occur
Work with DevOps to improve observability, logging, and alerting strategies
Suggest Workarounds and Implement Quick Fixes
Understand the product and customer use cases to provide workaround solutions when needed
Execute minor SQL queries and data fixes to resolve customer issues without requiring development team intervention
Leadership and Team Management
Lead and mentor a team of junior support engineers, ensuring they follow best practices in incident handling
Train the support team on troubleshooting common production issues
Establish clear ownership of incident response to reduce ad-hoc escalations to the development team
Required Qualifications:
Technical Skills:
5+ years of experience in production support, incident management, or site reliability engineering
Good expertise in Linux/Unix systems and troubleshooting
Experience with monitoring tools such as ELK Stack, Grafana, Prometheus, and CloudWatch
Proficiency in SQL (MySQL, PostgreSQL, or Oracle) for running queries and applying minor data fixes
Hands-on experience with log analysis and debugging using ELK Stack
Knowledge of scripting languages such as Shell, Python, or Groovy to automate incident handling
Familiarity with microservices, REST APIs, and message queues like RabbitMQ and Kafka
Soft Skills and Leadership:
Strong problem-solving and troubleshooting skills under pressure
Ability to mentor junior engineers and effectively lead small teams
Excellent communication skills for collaboration with engineering, CS and DevOps teams
Proactive mindset to reduce developer involvement in incident handling and improve overall system reliability
Powered by JazzHR
r3BLDTSOll
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job