Krutrim

Platform System Engineer (AI Labs)

Palo Alto, CA, US

about 2 months ago

Save Job

Summary

Job Title: AI Cloud Platform System Engineer

Location: US-San Francisco Bay Area

Position Type: Full-Time

Job Summary

We seek an AI Cloud Platform System Engineer to build, scale and optimize LLM training/inference/Data Platform. This role spans distributed training systems, GPU/CPU compute optimization, inference frameworks optimization and data platform for training/inferencing. You will ensure a resilient, cost-efficient platform for both training and production inference workloads, leveraging Kubernetes-native solutions.

Key Responsibilities

Distributed Training/Inference Platform Development

Design and maintain scalable platforms for distributed AI/ML training and serverless inference.
Optimize workload distribution across GPU clusters (e.g., model parallelism, mixed-precision training) for performance and cost.
Integrate frameworks like PyTorch, DeepSpeed, Triton, vLLM, and NVIDIA NeMo.
Collaborate with AI researchers to optimize model architectures for training/inference latency and throughput.

Platform & System Optimization

Compute: Profile and debug bottlenecks using tools like PyTorch Profiler and NVIDIA Nsight.
Storage/Caching: Build high-throughput data pipelines using S3, PVC, or distributed streaming (e.g., Kafka).
Networking: Reduce bottlenecks via RDMA/InfiniBand, NCCL, and TCP/IP tuning.
GPU Utilization: Implement kernel fusion, memory optimization, and auto-scaling.

Kubernetes-Centric Development

Develop Kubernetes Custom Resource Definitions (CRDs) to automate deployment, scaling, fault recovery, and monitoring of AI workloads.
Build operators for intelligent resource scheduling, Auto-Scaling (HPA/VPA), and fault tolerance for distributed training/inference jobs.
Build observability tools for GPU utilization, model latency, and system health
Leverage tools like Kubeflow, Kserve, KubeRay or SkyPilot for workflow orchestration.

Preferred Qualifications

Technical Skills

2+ years of experience in ML infrastructure (LLM training/inference platforms preferred).
Proficiency in Kubernetes (CRDs, Operators, Helm, Knative, Kserve), PyTorch, and cloud-native systems (AWS/GCP/Azure).
Expertise in distributed training optimizations (e.g, Nemo, Pytorch, DeepSpeed) and inference frameworks (e.g. Triton, vLLM, Sglang).
LLM-specific optimizations (e.g., MoE architectures, speculative decoding).
Networking (InfiniBand, NCCL) and storage solutions (S3, Ceph/MinIO, PVC)

Education & Soft Skills

MS/PhD in Computer Science, AI/ML, or equivalent hands-on experience.
Strong collaboration skills to interface with research and engineering teams.
Problem-solving agility to balance performance, cost, and scalability.

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job

Krutrim

Platform System Engineer (AI Labs)

Palo Alto, CA, US

Summary

How strong is your resume?

How strong is your resume?

MORE JOBS LIKE THIS

Our Company

Career Guides

Career Advice

Support