Upcoming Top AI Start up

Machine Learning Engineer

5 days ago

Save Job

Responsibilities

• Design, deploy, and maintain large distributed ML training and inference clusters

Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire ML lifecycle
Research and test various training approaches including parallelization techniques and numerical precision trade-offs across different model scales
Analyze, profile and debug low-level GPU operations to optimize performance
Stay up-to-date on research to bring new ideas to work
Work across the full ML stack (data, model, eval, and infrastructure)
Implement novel model architectures and training algorithms
Build data pipelines and training infrastructure for massive, petabyte-scale, multimodal datasets
Rapidly iterate on experiments and ablations
Stay up-to-date on research to bring new ideas to work

What we’re looking for

We value a relentless approach to problem-solving, rapid execution, and the ability to quickly learn in unfamiliar domains.

Strong grasp of machine learning fundamentals, and depth in at least one core domain (e.g. Computer Vision, Sensor Fusion, Language Models, Physics- informed NNs)
Experienced at training models and understanding experiment results through careful analysis and ablation studies.
Experienced at writing and optimizing massive petabyte-scale data pipelines.
Familiarity with distributed training.
[bonus] Familiarity with meteorology, computational fluid dynamics, and/or

numerical simulations.

Strong grasp of state-of-the-art techniques for optimizing training and inference workloads
Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models
Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings
Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)
Background working on distributed task management systems and scalable model serving & deployment architectures
Understanding of monitoring, logging, observability, and version control best practices for ML systems