Orion Health

Senior Site Reliability Engineer

Toronto, ON, CA

13 days ago

Save Job

Summary

Innovate With Purpose

Do you want to work for a company that is innovating and making a difference to the health and wellbeing of people all over the world? We’re not about selling meaningless, unnecessary products for corporate profitability. You’ll be working on technology that will revolutionise global health systems so that we can finally get the healthcare we all want - a basic human right.

We like to think of ourselves as a community of start-ups where you can be your true, genuine self. Each of our product teams has the autonomy to decide how they operate and contribute towards our mission of providing each person with the right care at the right time and in the right place.

Orion Health is excited to be expanding our galaxy by recruiting for a number of stellar individuals to join our team to help us deliver to our global customer base. If you want to climb aboard the rocketship and help us revolutionise global health systems, astronomical opportunities await.

Position Purpose:

Collaborate in the construction of the automation for infrastructure and software delivery, and being the primary executor of such processes, collecting feedback from the support of operational sites. Responsible for availability, latency, reliability, performance, efficiency, change management, monitoring, emergency response, improve system availability and capacity planning.

Success in this Role looks like…

Through a proactive approach, relentless improvement and constant training, the SREs run the customer environments by monitoring availability and taking a holistic view of system health
SLAs are always met through automation with none to small involvement from the team, and the number of customers and provided services can scale without correlation with the size of the team
Bridge the gap between development and operations
Well built software and systems to manage platform infrastructure and applications
Measured and optimised system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve

Business Unit:

North American Managed Services

This unit contributes to Orion Health’s purpose to enabling client success by introducing and maintaining managed environments, policies and procedures in line with ITIL aligned standards and maintain focus on all elements of support for our customers

Key Relationships:

Internal

Technical Operations Leads, Implementation Consultants, Solution Architects, Service Management Lead, Service Operations, and Product Team, Database administrators
SREs must have constant communication with the Development Team and Technical Leads (Software Designers) to understand the concrete requirements of the products and the configuration required to be part of the automation

External

Client, and Third Party vendors

Essential functions:

Operations Support and Issue Resolution

Participate in the daily management of multiple Orion Health solutions hosted in AWS Cloud, Infrastructure and Networking including but not limited to:

Daily monitoring and alert responses, identify potential problems, and implement alerts to notify relevant parties
Following a Change Request from creation to completion, providing review, detail validation and execution of all tasks
Work with other teams to ensure a smooth and reliable releases
Tuning the Application stack to improve stability and resultant uptime metrics
Automate repetitive tasks, such as development, scaling, and patching, to improve efficiency and reduce manual effort.
Acute and Recurring issue investigation and resolution
Performance Trend Analysis, identifying and address performance bottlenecks to ensure system can handle expected loads and user traffic
Log Analysis and Error resolution
Manage and maintain the underlying infrastructure, including servers, and networks, to ensure smooth operations
Handover Testing
Document procedures, and processes to facilitate learning and knowledge transfer within the team
Root Cause Analysis; involved in investigating and resolving incidents, including outages and performance problems, to minimize disruption
Plan for future capacity needs to ensure systems can and handle increasing demand
Developing and testing disaster recovery plans to guarantee data integrity, system resilience, and swift restoration of services in case of critical incidents.
Coordinate with teams to maintain Service Level Agreements

Internal Development

Responsible for the Continuous Integration of updates for over 10 Products/solutions released by Development teams into Orion Health solutions.

Build secure and scalable infrastructure to manage customer data

Internal Support

Participate in On-Call RotationWork with Development, Solution Adoption, Managed Services, Professional Services, Support and other teams to provide clients with a world class stable solution platform

Behavioural And Technical Capabilities

Highly proactive and motivated Software/System Engineer, always seeking opportunities for improvement and taking ownership of the challenges
Strong understanding of software engineering principles, operating systems (Windows and Linux), networking, and cloud technologies
Experienced in Windows and Linux OS administration, with hands-on exposure to DataCenter operations
Proficient in Active Directory, Group Policy Object (GPO) management, DNS, and Active Directory service health monitoring
Demonstrated scripting and automation experience using PowerShell, Python, Bash, and other languages
Familiarity with infrastructure automation tools such as Puppet and Ansible is a plus
Capable of communicating ideas and collaborating productively across technical teams
Committed to continuous learning and knowledge sharing within the team
Ability to design secure distributed web services and manage network security at scale
Solid understanding of TCP/IP, DNS, DHCP, VLANs, VPNs, firewall configuration, Load Balancers, and other network appliances

Relevant Experience

4–6 years in a Site Reliability Engineering or equivalent role
5 years in systems/application support and/or development
Strong scripting background with experience in object-oriented and structured programming
Experience with automation, infrastructure as code, and orchestration (e.g., Puppet, Ansible, Kubernetes, CloudFormation, Terraform)
Exposure to on-prem to AWS cloud migration projects and Red Hat OS upgrades is an asset
Working knowledge of Splunk monitoring tools and strategies
Strong foundation in Network Architecture and Security
Experience with CI/CD pipelines and deployment automation in cloud environments (AWS preferred)

Education & Qualifications:

Essential

Bachelor’s Degree in a technical discipline or equivalent experience
Experience in supporting cloud-based production systems
A technical certification in System Administration, Cloud Engineering, or DevOps

Desirable

Formal training and certification in *nix scripting, non-SQL / SQL, Oracle databases and Big Data technologies and AWS Cloud Services is a plus
HIPAA or HITRUST understanding

Orion Health

Senior Site Reliability Engineer

Toronto, ON, CA

Summary

How strong is your resume?

How strong is your resume?

People also searched:

Our Company

Career Guides

Career Advice

Support