Here at OCI we’re building the world’s largest AI clusters and we’re the fastest at bringing them to market. The AI Infrastructure organization at OCI is leading this effort. We are looking for a highly skilled and motivated Site Reliability Engineer to join our team. In this role you will play a crucial role in ensuring the reliability, scalability, and performance of our systems and infrastructure. You will collaborate with cross-functional teams to optimize our technology operations and deliver exceptional customer experience. Join our SRE team and contribute to the reliability and performance of our systems, ensuring an exceptional user experience for our customers. You will have the opportunity to work with cutting-edge technologies and make a significant impact on our organization's success.
Career Level - IC3
Responsibilities
Monitor and maintain the health, performance, and availability of our systems and infrastructure.
Collaborate with development and operations teams to define and implement best practices for system reliability and performance.
Develop and maintain proactive monitoring, alerting, and incident response systems to detect and resolve issues quickly.
Participate in quick GPU delivery and improve the availability and quality of the delivered GPU servers.
Troubleshoot and resolve complex issues related to system performance, scalability, and reliability.
Automate repetitive tasks and develop tools to improve operational efficiency and productivity.
Collaborate with cross-functional teams to plan and execute infrastructure changes, upgrades, and deployments.
Conduct post-incident analysis and identify areas for improvement to prevent future incidents.
Stay up-to-date with industry trends and emerging technologies in site reliability engineering and apply them to improve our systems and processes.
Document system configurations, processes, and procedures to ensure knowledge sharing and maintain system integrity.
Requirements
BS (or equivalent experience) in Computer Science, Engineering, or related field.
3-5 years of experience as a Site Reliability Engineer or a similar role, with a focus on system reliability, performance, and scalability.
Systematic problem-solving approach, strong communication skills, a sense of ownership, and drive.
Strong knowledge of cloud infrastructure, distributed systems, and network architecture.
Proficiency in scripting or programming languages such as Python, Go, or Bash.
Experience with automation and configuration management tools like Terraform, Ansible, or Chef.
Familiarity with monitoring and alerting tools such as Prometheus or Grafana.
Strong problem-solving and troubleshooting skills, with the ability to analyze complex systems and identify areas for improvement.
Excellent communication and collaboration skills, with the ability to work effectively in cross-functional teams.
Ability to adapt to a fast-paced, dynamic environment and manage multiple tasks and priorities effectively.
Preferred Qualifications
Experience in Nvidia training technologies (CUDA, NCCL)
Working familiarity with networking protocols (TCP/IP, UDP, HTTP) and standard network architectures.
Strong technical knowledge in distributed systems, high performance computing, and GPU systems.
Experience in AI model training infrastructure.