Acceler8 Talent

ML Infrastructure

Boston, MA, US

4 days ago
Save Job

Summary

Join Us as a Training Infrastructure Engineer (ML Systems & Foundation Models)


We're building the next wave of general-purpose AI—efficient, scalable, and designed to integrate seamlessly across enterprises. Our foundation model platform enables users to build, deploy, and optimize AI systems with precision and control.


As part of the technical core, you’ll design and optimize the distributed infrastructure powering everything from lean, task-specific models to large-scale multimodal systems. You’ll work on critical systems that make large-scale training faster, more reliable, and cost-efficient—at the very frontier of AI and infrastructure.


What You’ll Work On

You’ll build the distributed training infrastructure that makes high-throughput, multi-node, multi-GPU training not only possible—but efficient, fault-tolerant, and scalable. From high-performance data pipelines to cutting-edge sharding techniques, your work will directly impact how fast and far our models can go.


Key Challenges You’ll Take On

  • Scaling training infrastructure across diverse hardware clusters and networking topologies
  • Designing robust checkpointing systems for complex model states
  • Optimizing communication patterns for parallelism strategies (tensor, pipeline, data)
  • Building efficient, multimodal data loaders that eliminate I/O bottlenecks
  • Collaborating with ML teams to scale novel training algorithms and architectures


You're a Great Fit If You Have:

  • Deep experience with large-scale distributed training frameworks (e.g., PyTorch Distributed, DeepSpeed, Megatron-LM)
  • A strong grasp of hardware accelerators (GPUs, TPUs) and distributed system design
  • Hands-on expertise optimizing performance across compute, memory, and network boundaries
  • Experience working with diverse modalities: text, images, video, audio
  • A passion for debugging hard systems problems and pushing infrastructure to its limits


If you're passionate about distributed systems, large-scale training, and creating infrastructure that drives real AI breakthroughs—this is your chance to build it from the ground up.

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job