Krutrim

Platform System Engineer (AI Labs)

Palo Alto, CA, US

about 2 months ago
Save Job

Summary

Job Title: AI Cloud Platform System Engineer

Location: US-San Francisco Bay Area

Position Type: Full-Time


Job Summary

We seek an AI Cloud Platform System Engineer to build, scale and optimize LLM training/inference/Data Platform. This role spans distributed training systems, GPU/CPU compute optimization, inference frameworks optimization and data platform for training/inferencing. You will ensure a resilient, cost-efficient platform for both training and production inference workloads, leveraging Kubernetes-native solutions.


Key Responsibilities

Distributed Training/Inference Platform Development

  • Design and maintain scalable platforms for distributed AI/ML training and serverless inference.
  • Optimize workload distribution across GPU clusters (e.g., model parallelism, mixed-precision training) for performance and cost.
  • Integrate frameworks like PyTorch, DeepSpeed, Triton, vLLM, and NVIDIA NeMo.
  • Collaborate with AI researchers to optimize model architectures for training/inference latency and throughput.


Platform & System Optimization

  • Compute: Profile and debug bottlenecks using tools like PyTorch Profiler and NVIDIA Nsight.
  • Storage/Caching: Build high-throughput data pipelines using S3, PVC, or distributed streaming (e.g., Kafka).
  • Networking: Reduce bottlenecks via RDMA/InfiniBand, NCCL, and TCP/IP tuning.
  • GPU Utilization: Implement kernel fusion, memory optimization, and auto-scaling.


Kubernetes-Centric Development

  • Develop Kubernetes Custom Resource Definitions (CRDs) to automate deployment, scaling, fault recovery, and monitoring of AI workloads.
  • Build operators for intelligent resource scheduling, Auto-Scaling (HPA/VPA), and fault tolerance for distributed training/inference jobs.
  • Build observability tools for GPU utilization, model latency, and system health
  • Leverage tools like Kubeflow, Kserve, KubeRay or SkyPilot for workflow orchestration.


Preferred Qualifications

Technical Skills

  • 2+ years of experience in ML infrastructure (LLM training/inference platforms preferred).
  • Proficiency in Kubernetes (CRDs, Operators, Helm, Knative, Kserve), PyTorch, and cloud-native systems (AWS/GCP/Azure).
  • Expertise in distributed training optimizations (e.g, Nemo, Pytorch, DeepSpeed) and inference frameworks (e.g. Triton, vLLM, Sglang).
  • LLM-specific optimizations (e.g., MoE architectures, speculative decoding).
  • Networking (InfiniBand, NCCL) and storage solutions (S3, Ceph/MinIO, PVC)


Education & Soft Skills

  • MS/PhD in Computer Science, AI/ML, or equivalent hands-on experience.
  • Strong collaboration skills to interface with research and engineering teams.
  • Problem-solving agility to balance performance, cost, and scalability.

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job