Persistent Systems

Site Reliability Engineer

Bengaluru, KA, IN

6 days ago
Save Job

Summary

About the role:

Engineer should have a mindset to maximize system availability through proactive means. The candidates should build robust automation solutions to eliminate or minimize incidents as well as investigate and resolve issues in response to production incidents. The candidates should perform trend analysis on production issues, perform bug fixes/enhancements in the scripts used for operational tasks and should be comfortable working with Development/scrum teams and support their demanding needs to ensure availability & performance of the applications & platform.


The candidate must have 8+ years of relevant industry experience, proven track record of working in a large scale and global SRE or DevOps implementation projects with application Development experience.


In this role, you will:


• Drive innovation in digital technology & Innovation application portfolios, increase efficiency through automation, SRE and Agile with an emphasis on enhancing end user experience.

• Leading Team on all technical issues related to APP and WEB tier.

• Expert in Middleware Administration (WebLogic, Tomcat, Apache, IIS) and Strong working Experience in production support of middleware applications.

• Drive automation of manual repetitive operational tasks and Engineer solutions to automate production game plans.

• Perform trend analysis of repetitive production issues and engage relevant operation/development teams to address the failure patterns and incidents.

• Drive adoption of self-healing and resiliency patterns.

• Enhance the end-to-end application or system observability by enhancing the alarm setup or developing new dashboards using the monitoring/log analysis/analytic tools such as splunk, AppD, Elastic Search, PowerBI, Tableau etc.

• Closely work with enterprise SRE team and perform SRE maturity assessment for applications in scope, baseline current state metrics, establish SLI/SLOs, Error budget, Service Levels, monitoring, alerting and recovery objectives and perform periodic resiliency testing for all applications in scope.

• Manage the Toil Registry created for the group & Reduce toil by fine tuning existing monitoring/alarming setup or by developing tools to automate the routine tasks using ansible, shell scripting etc.

• Develop a solution for self-healing of alarms thus aiding in production Incident reduction.

• Enhance or fix the bugs in the existing patching & production release install scripts for improving the success ratio and own/participate in the root cause analysis using 5-Why approach.

• Recommend infra level solutions by proactively analyzing low level errors in application logs which are undetected to enhance the customer experience.

• Direct large-scale projects and application implementations from proof of concept through testing and installation.

• Troubleshoot high severity production incidents in real time, improve system availability & reliability by facilitating blameless postmortems to prevent problem recurrence.

• Apply analytics on historical monitoring or incident data for predicting issues and take proactive actions.

• Statistical gathering and analysis to assist architecture engineering and development teams in capacity planning requirements to support projected transaction volumes, response times and system availability targets.

• Collaboration with enterprise partners on issues and initiatives that impact the infrastructure.

• Add value to team delivery and work with team to complete tasks with high quality and actively learn new skills/technologies.


Required Qualifications:

• Bachelor’s Degree or equivalent experience in any software engineering discipline.

• 8+ years’ experience in production support & SRE implementation in a large-scale environment (preferably in banking domain.

• Hands on experience in web & middleware platform (Apache, tomcat, WebLogic, OPENSHIFT etc.) in Linux/windows environments.

• Hands on experience in supporting OPENSHIFT applications and microservice architecture-based applications.

• Hands on experience with monitoring/log analysis/dashboard tools such as AppDynamics, Splunk, Elastic Search, Netcool, PowerBI, Tableau etc.

• Proficiency in shell scripting, ansible and one programming language such as python or JavaScript.

• Good knowledge in DevOps tools - GitHub, Jenkins, UCD and cloud platforms such as GCP.

• Knowledge in Database and network environments.

• Good knowledge in Agile and ITIL framework.


Desired Qualifications:

• Experience in Unix /Linux Server Support domain.

• Cloud certification

• Experience with Tableau or similar BI tools.

• Bachelors or Master's degree in Computer Science, Software Engineering or a related field


Job expectations:

• Strong analytical and problem-solving abilities, with quick adaptation to new technologies, methodologies, and systems.

• Demonstrate strong written, oral communication skills, documentation skills and able to work independently.

• Self-learner, understand technology environment and deliver faster.

• Demonstrate a proactive, hands-on approach, strong system and analytical skills with focus on streamlining the operational tasks using automation.

• Willing to work in shifts (24x7 models).

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job

People also searched: