Upcoming Top AI Start up

Machine Learning Engineer

San Francisco, CA, US

5 days ago
Save Job

Summary

Responsibilities

• Design, deploy, and maintain large distributed ML training and inference clusters

  • Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire ML lifecycle
  • Research and test various training approaches including parallelization techniques and numerical precision trade-offs across different model scales
  • Analyze, profile and debug low-level GPU operations to optimize performance
  • Stay up-to-date on research to bring new ideas to work
  • Work across the full ML stack (data, model, eval, and infrastructure)
  • Implement novel model architectures and training algorithms
  • Build data pipelines and training infrastructure for massive, petabyte-scale, multimodal datasets
  • Rapidly iterate on experiments and ablations
  • Stay up-to-date on research to bring new ideas to work


What we’re looking for

We value a relentless approach to problem-solving, rapid execution, and the ability to quickly learn in unfamiliar domains.

  • Strong grasp of machine learning fundamentals, and depth in at least one core domain (e.g. Computer Vision, Sensor Fusion, Language Models, Physics- informed NNs)
  • Experienced at training models and understanding experiment results through careful analysis and ablation studies.
  • Experienced at writing and optimizing massive petabyte-scale data pipelines.
  • Familiarity with distributed training.
  • [bonus] Familiarity with meteorology, computational fluid dynamics, and/or

numerical simulations.

  • Strong grasp of state-of-the-art techniques for optimizing training and inference workloads
  • Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models
  • Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings
  • Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)
  • Background working on distributed task management systems and scalable model serving & deployment architectures
  • Understanding of monitoring, logging, observability, and version control best practices for ML systems

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job

People also searched: