GMI Cloud

Head of AI Infrastructure Engineering

Mountain View, CA, US

5 days ago
Save Job

Summary

About GMI Cloud: GMI Cloud is a pioneering AI cloud infrastructure company dedicated to providing cutting-edge solutions that accelerate the development and deployment of artificial intelligence. We are building a world-class platform that empowers organizations to tackle their most complex AI challenges, and we are seeking a visionary and experienced Head of AI Infrastructure Engineering to lead our engineering efforts in this critical area.



About the Role: The Head of AI Infrastructure Engineering will be responsible for the strategic direction, design, implementation, and operation of GMI Cloud's AI infrastructure. This leader will drive innovation and efficiency in our systems, ensuring they meet the demanding performance, scalability, security, and reliability requirements of modern AI workloads. The ideal candidate will possess deep expertise in distributed systems, cloud technologies, and high-performance computing (HPC), coupled with a proven ability to lead and inspire engineering teams in a fast-paced, dynamic environment.



Responsibilities:


  • Strategic Leadership: Develop and execute the long-term technology strategy for GMI Cloud's AI infrastructure, aligning with overall business objectives and anticipating future AI workload demands, with a strong focus on scalability, cost-effectiveness, and performance optimization.
  • Infrastructure Architecture: Oversee the design and architecture of scalable, reliable, and secure AI infrastructure, encompassing compute resources (CPU, GPU, specialized accelerators, and bare metal), storage systems (object, block, file, distributed storage), and networking (high-speed interconnects, InfiniBand, software-defined networking).
  • AI-Specific Infrastructure: Lead the design and optimization of infrastructure to support various AI/ML workloads, including distributed training, model serving, and large-scale data processing, demonstrating expertise in GPU resource management, model parallelism, data locality, and efficient data pipelines.
  • Technology Evaluation and Adoption: Evaluate and recommend the adoption of new technologies and tools to optimize AI infrastructure performance and efficiency, such as advanced accelerators, InfiniBand, RoCE, containerization (Kubernetes), serverless computing, and hardware acceleration technologies.
  • Engineering Management: Lead and mentor a team of AI infrastructure and DevOps engineers, fostering a culture of innovation, collaboration, continuous learning, ownership, and adherence to best practices in software development and infrastructure management.
  • DevOps and Automation: Drive the implementation of DevOps practices and automation to streamline infrastructure deployment, management, and monitoring, including CI/CD pipelines, infrastructure-as-code tools (Terraform, Ansible), configuration management, and monitoring/observability solutions.
  • Performance Optimization: Lead efforts to optimize AI workload performance, focusing on factors such as latency, throughput, resource utilization, and scalability, and implement robust monitoring and observability solutions to ensure system health and performance.
  • Cost Efficiency: Optimize infrastructure costs through efficient resource allocation, capacity planning, and the use of cost-effective technologies and cloud services, including cloud cost management strategies and bare metal optimization.
  • Security and Compliance: Ensure the security and compliance of the AI infrastructure, adhering to industry best practices, security protocols, and relevant regulations (e.g., data privacy, security certifications).
  • Collaboration: Collaborate closely with product, research, and operations teams to ensure seamless integration of AI infrastructure with GMI Cloud's offerings and customer needs, and to provide technical guidance and support.
  • Budget Management: Manage the budget for cloud infrastructure engineering, ensuring cost-effective resource allocation, forecasting infrastructure needs, and reporting on infrastructure spending.



Qualifications:



  • Bachelor's degree in Computer Science, Engineering, or a related field; Master's degree preferred.
  • 15+ years of experience in designing, building, and managing complex infrastructure systems, with a strong and demonstrable focus on cloud computing, high-performance computing, and AI/ML workloads, including significant experience with containerized and orchestrated environments and InfiniBand-based networks.
  • Proven experience in leading and managing engineering teams, with a strong emphasis on technical leadership, mentorship, talent development, and performance management.
  • Deep understanding of cloud computing principles, distributed systems, and networking technologies, including cloud service models (IaaS, CaaS, PaaS), virtualization, network architecture, and software-defined networking.
  • Extensive knowledge of AI hardware and software, including GPUs, accelerators, machine learning frameworks (TensorFlow, PyTorch), distributed training paradigms, and model serving technologies, and a strong understanding of the AI ecosystem.
  • Strong experience with DevOps practices, automation tools, and infrastructure-as-code (Terraform, Ansible), and experience building and managing CI/CD pipelines, configuration management systems, and monitoring/observability tools.
  • Excellent communication, collaboration, and problem-solving skills, with the ability to effectively communicate complex technical concepts to both technical and non-technical audiences and to influence stakeholders.
  • Demonstrated ability to drive innovation, deliver results in a fast-paced environment, and adapt to evolving technologies and industry trends.

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job