Inflection AI

Senior Sofware Engineer (ML Training Infrastructure)

Palo Alto, CA, US

24 days ago
Save Job

Summary

Senior Sofware Engineer (ML Training Infrastructure)


Inflection AI is a public benefit corporation leveraging our world class large language model to build the first AI platform focused on the needs of the enterprise.


Who we are:

Inflection AI was re-founded in March of 2024 and our leadership team has assembled a team of kind, innovative, and collaborative individuals focused on building enterprise AI solutions. We are an organization passionate about what we are building, enjoy working together and strive to hire people with diverse backgrounds and experience.


Our first product, Pi, provides an empathetic and conversational chatbot. Pi is a public instance of building from our 350B+ frontier model with our sophisticated fine-tuning (10M+ examples), inference, and orchestration platform. We are now focusing on building new systems that directly support the needs of enterprise customers using this same approach.


Want to work with us? Have questions? Learn more below.


About the Role


As a Senior Software Engineer on the ML Training Infrastructure team, you’ll design and operate the systems that power large-scale machine learning workflows—from model training through to production deployment. You'll develop control planes, manage distributed compute clusters, and build the tooling that ensures our platform remains reliable, secure, and highly scalable. We're looking for engineers with hands-on experience running ML infrastructure in production and a strong open-source mindset.


This role is a strong fit if you:

  • Have deep experience running production ML systems and building tooling to support them.
  • Are confident managing distributed systems using Kubernetes, SLURM, and Ray.
  • Have a strong open-source track record and are comfortable working with both community and internal tools.
  • Bring a security-aware mindset to infrastructure design, even if you’re not in a dedicated security role.
  • Enjoy working in fast-paced, technically ambitious environments focused on scaling ML systems.


In this role, you will:

  • Build scalable infrastructure for ML workflows, from training to production ops.
  • Design and operate control planes and tools to ensure secure, efficient service delivery.
  • Collaborate across teams to optimize system performance and resource utilization.
  • Orchestrate distributed ML workloads using frameworks like Kubernetes, SLURM, and Ray.
  • Evaluate and adopt emerging technologies to keep the platform at the cutting edge.
  • Apply security best practices to maintain strong operational posture across ML environments.


ML infrastructure, Kubernetes, SLURM, Ray, distributed systems, model training, production ML, control planes, ML tooling, orchestration, infrastructure security, scalable systems, AI workflows, open-source infrastructure, machine learning deployment

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job