YTL AI Cloud

DC System Operations Manager

Federal Territory of Kuala Lumpur, MY

6 days ago
Save Job

Summary

Role Description

YTL AI Cloud is seeking a data center system operations Manager to lead and oversee the 

operations of our cutting-edge GPU data centers in Johor, Malaysia. This role is critical to the company to ensure the seamless functionality, scalability, and performance of our infrastructure that supports high demand AI workloads.


Key Responsibilities

  • Lead a team of system operation engineers to ensure smooth and reliable data center operations.
  • Mentor, train, and manage the performance of team members to meet organizational goals.
  • Manage daily operations of GPU clusters, ensuring system health, uptime, and performance.
  • Develop and enforce standard operating procedures (SOPs) for data center operations and incident management.
  • Ensure timely resolution of system issues and maintain SLA compliance.
  • Optimize workflows, including hardware provisioning, monitoring, and scaling GPU resources.
  • Collaborate with cross-functional teams, including network engineers, software developers, and project managers, to support AI workload requirements.
  • Manage the deployment, configuration, and optimization of GPU servers, network devices, and supporting infrastructure (e.g. CPU servers and storage).
  • Work closely with cross-functional teams, including network engineers, system administrators, and developers, to support AI workloads.
  • Establish clear operational objectives, KPIs, and ensure accountability across the team.
  • Generate detailed operational reports, including incident analysis and recommendations for improvement.


Qualifications

  • Bachelor’s degree in Computer Science, Information Technology, Electrical Engineering, or a related field. Equivalent experience will be considered. Master’s degree is preferred.
  • 5+ years of experience in data center operations or system administration, with at least 2 years in a managerial role.
  • Exceptional experience in data center operations, system administration, or a similar role.
  • Strong knowledge of server hardware, including GPU cards, CPU configurations, and storage solutions.
  • Understanding of Linux fundamentals and Kubernetes environments.
  • Familiarity with monitoring tools (e.g., Prometheus, Grafana) and logging frameworks.


Desired Skills

  • Familiarity with monitoring and automation tools
  • Strong experience with storage systems (e.g., NVMe, SAN, NAS), networking concepts, and protocols (e.g., TCP/IP, RDMA) will be advantageous.
  • Knowledgeable in operating ticketing system and troubleshooting process in CPU/GPU cluster.
  • Strong knowledge of GPU hardware (e.g., NVIDIA GPUs), server architecture, and storage solutions.
  • Familiarity with networking concepts, including TCP/IP, VLANs, and load balancing. 
  • Experience in managing bare metal servers, GPU infrastructure, or high-performance computing systems will be an added advantage.
  • Experience managing GPU clusters and AI/ML infrastructure is a strong advantage.
  • Understanding of containerization and orchestration tools (e.g., Docker, Kubernetes) is a plus.


How strong is your resume?

Upload your resume and get feedback from our expert to help land this job

People also searched: