Mandatory Skills – Strong exp in SRE , Production support, Incident management, AWS, Python , Shell scripting
Job Description:
What you¿ll do:
Contribute to all aspects of the production environment for all merchant loyalty use cases ¿ Contribute to strategies for all facets of observability ¿ Identify areas of improvement in production
Ability to understand MTTR, SLO, SLI definitions and apply them to services.
Respond to incidents and own/drive incident manager role during active CIs Keep mitigation/resolution efforts on task by asking for updates, contributing data/investigation (when appropriate) Provide progress summaries and comms suggestions to Support within SLAs to enable effective customer comms during CIs ¿
Contribute to reliable, fault-tolerant, efficiently scalable and cost-effective services and infrastructure ¿
Maintain services once they are live by measuring and monitoring availability, latency and overall system health
Practice sustainable incident response and blameless postmortems
Able to create and execute queries to big data platform and relational data tables to identify process issues or to perform mass updates, preferred ¿ Ability to isolate problems between hardware and software
Analyze ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns ¿ Support services before they go live through activities such as system design consulting, capacity planning and launch reviews ¿ Maintain services once they are live by measuring and monitoring availability, latency and
overall system health
Execute sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity ¿ Work with a global team spread across tech hubs in multiple geographies and time zones What experience you need: ¿ Experience in Splunk and SignalFx
Experience with Amazon Web Services including RDS ¿ Relevant data DevOps, SRE, or general systems engineering experience ¿ Experience in managing large production platforms.
Experience architecting and implementing data governance processes and tooling (data catalogs, lineage tools, role-based access control, PII handling)
Strong coding ability in Python or other languages like Java, C#, Golang, C, C++, Perl or Ruby etc.
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
Ability to help debug and optimize code and automate routine tasks ¿
Ability to support many different stakeholders. Experience in dealing with difficult situations and making decisions with a sense of urgency is needed
Interest in designing, analyzing and troubleshooting large-scale distributed systems
Appetite for change and pushing the boundaries of what can be done with automation
Experience in working across development, operations, and product teams to prioritize needs and to build relationships is a must ¿ Good Handle on Change Management and Release Management aspects of Software
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job