Shrive Technologies

Site Reliability Engineer/SRE

about 1 month ago

Save Job

Title :Site Reliability Engineer/SRE

Location : Indiana Remote

50% support/Operations

Runtime production operations support Sev 0 & Sev 1
"Super T shaped" role that can float between squads with focus on Continuous Process Improvement

50% Development engineering

Automation of repetitive tasks
SREs are focused on building and monitoring anything in production that improves service resiliency

Job Description

Possess hands on experience in various stages of IT Infrastructure management Lifecycle.
Experience in Client relationship, Service Integration, Team building, Process and People Management.
Experience in successfully managing cloud operations and resources to deliver Client Satisfaction.
Experience building, integrating, deploying and provisioning cloud services
IaaC: Implemented large scale infrastructure using Cloud ARM / CF / Terraform Templates
Experienced in scripting languages such as PowerShell, Python and Shell
Experience with configuration management tools (Chef, Puppet, Ansible)
Experience with Collaboration tools such as Atlassian (Jira, Confluence)
Successfully governed DC consolidation and migration Projects
Optimization of on-premise and cloud infrastructure and participate in design reviews
Led multiple implementations of infrastructure monitoring using native monitoring, and third-party tools
Capacity planning and management create, use, maintain a capacity model for on-prem and Cloud workloads
Certified in Cloud Architecture, Operations and Engineering
Certified in ITIL and project management

Responsibilities

Resolve critical and complex technical issues in a global support delivery team. Combine technical expertise and customer requirements to solve complex business challenges.
Quickly identify customer issues WRT Cloud services; and being able to conduct in-depth diagnostics on Cloud platform and services.
Perform RCA of critical incidents. Analyze and eliminate top issues impacting customer experience.
Create documentation (SOP's & TSG's) to help L1/L2 teams to support operations.
Work with leadership on process improvement and strategic initiatives
Serve as the SME for selecting technology candidates and self healing capabilities for future service development
Perform large scale automation, combining independent processes into robust behavior

Control Points

Provide Architectural Reviews and Signoffs on a Service based on ability to achieve availability targets
Accept or reject services based on their ability to achieve SLAs
Validate scalability testing results, and test limits of hardware and software