Acceler8 Talent

ML Infrastructure

Boston, MA, US

4 days ago

Save Job

Summary

Join Us as a Training Infrastructure Engineer (ML Systems & Foundation Models)

We're building the next wave of general-purpose AI—efficient, scalable, and designed to integrate seamlessly across enterprises. Our foundation model platform enables users to build, deploy, and optimize AI systems with precision and control.

As part of the technical core, you’ll design and optimize the distributed infrastructure powering everything from lean, task-specific models to large-scale multimodal systems. You’ll work on critical systems that make large-scale training faster, more reliable, and cost-efficient—at the very frontier of AI and infrastructure.

What You’ll Work On

You’ll build the distributed training infrastructure that makes high-throughput, multi-node, multi-GPU training not only possible—but efficient, fault-tolerant, and scalable. From high-performance data pipelines to cutting-edge sharding techniques, your work will directly impact how fast and far our models can go.

Key Challenges You’ll Take On

Scaling training infrastructure across diverse hardware clusters and networking topologies
Designing robust checkpointing systems for complex model states
Optimizing communication patterns for parallelism strategies (tensor, pipeline, data)
Building efficient, multimodal data loaders that eliminate I/O bottlenecks
Collaborating with ML teams to scale novel training algorithms and architectures

You're a Great Fit If You Have:

Deep experience with large-scale distributed training frameworks (e.g., PyTorch Distributed, DeepSpeed, Megatron-LM)
A strong grasp of hardware accelerators (GPUs, TPUs) and distributed system design
Hands-on expertise optimizing performance across compute, memory, and network boundaries
Experience working with diverse modalities: text, images, video, audio
A passion for debugging hard systems problems and pushing infrastructure to its limits

If you're passionate about distributed systems, large-scale training, and creating infrastructure that drives real AI breakthroughs—this is your chance to build it from the ground up.

Acceler8 Talent

ML Infrastructure

Boston, MA, US

Summary

How strong is your resume?

How strong is your resume?

MORE JOBS LIKE THIS

Our Company

Career Guides

Career Advice

Support