OPENTEXT
OpenText is a global leader in information management, where innovation, creativity, and collaboration are the key components of our corporate culture. As a member of our team, you will have the opportunity to partner with the most highly regarded companies in the world, tackle complex issues, and contribute to projects that shape the future of digital transformation.
Your Impact
The Lead Site Reliability Engineer (SRE) will be responsible for ensuring the availability, reliability, and scalability of cloud infrastructure and services. This role focuses on automation, performance optimization, incident response, and CI/CD pipeline management to support highly available and resilient applications. The ideal candidate will bring deep expertise in AWS, Kubernetes, GitLab CI/CD, and Infrastructure as Code (IaC).
What The Role Offers
- Architect, deploy, and maintain highly available and scalable cloud environments in AWS.
- Design and manage Kubernetes clusters (EKS) and containerized applications with Docker.
- Implement auto-scaling, load balancing, and fault tolerance for cloud services.
- Develop and optimize Infrastructure as Code (IaC) using Terraform, Tofu, or Ansible.
- Design, implement, and maintain CI/CD pipelines using GitLab CI/CD and ArgoCD.
- Automate deployment workflows, infrastructure provisioning, and release management.
- Ensure secure, compliant, and automated software delivery across multiple environments.
- Implement observability and monitoring using tools like CloudWatch, Prometheus, Grafana, ELK, or Datadog.
- Analyze system performance, detect anomalies, and optimize cloud resource utilization.
- Drive incident response and root cause analysis, ensuring fast recovery (MTTR) and minimal downtime.
- Establish Service Level Objectives (SLOs) and error budgets to maintain system health.
- Implement security best practices, including IAM policies, encryption, network security, and vulnerability scanning.
- Automate patch management and security updates for cloud infrastructure.
- Ensure compliance with industry standards and regulations (SOC2, ISO27001, HIPAA, etc.).
- Work closely with DevOps, security, and development teams to drive reliability best practices.
- Lead blameless postmortems and continuously improve operational processes.
- Provide mentorship and training to junior engineers on SRE principles and cloud best practices.
- Participate in on-call rotations, ensuring 24/7 reliability of production services.
What You Need To Succeed
- Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
- 10-12 years of experience in Site Reliability Engineering (SRE), DevOps, or Cloud Engineering.
- Expertise in AWS Cloud – Hands-on experience with EC2, VPC, RDS, S3, IAM, Lambda, and EKS.
- Strong Kubernetes knowledge – Hands-on experience with EKS, Helm charts, and cluster management.
- CI/CD experience – Proficiency in GitLab CI/CD, ArgoCD for automated software deployments.
- Infrastructure as Code (IaC) – Experience with Terraform, Tofu
- Monitoring & Logging – Familiarity with CloudWatch, Prometheus, Grafana, ELK, or Datadog.
- Scripting & Automation – Proficiency in Python, Shell scripting, or Golang.
- Incident Management & Reliability Practices – Experience with SLOs, SLIs, error budgets, and chaos engineering.
OpenText's efforts to build an inclusive work environment go beyond simply complying with applicable laws. Our Employment Equity and Diversity Policy provides direction on maintaining a working environment that is inclusive of everyone, regardless of culture, national origin, race, color, gender, gender identification, sexual orientation, family status, age, veteran status, disability, religion, or other basis protected by applicable laws.
If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please contact us at
[email protected]. Our proactive approach fosters collaboration, innovation, and personal growth, enriching OpenText's vibrant workplace.
45926