Software Engineer, AI Systems Performance Modeling

Palo Alto, CA, US

Onsite
Full-time
2 days ago
Save Job

Summary

Join Tesla's Dojo Performance Team to design and optimize cutting-edge system-level simulation frameworks for AI accelerators. You will simulate the performance of thousands of Dojo compute nodes operating in parallel to drive state-of-the-art machine learning (ML) workloads. This role centers on modeling large-scale AI training systems, to evaluate performance of new kernels and mapping strategies. By analyzing trade-offs between memory, compute, and communication across system resources, you will help push the boundaries of AI performance and efficiency. * Develop system-level simulation frameworks to model the performance of massively parallel AI accelerators, including compute distribution, memory hierarchy, interconnects, and dataflow * Simulate and analyze how large-scale ML workloads, from FSD to LLMs, are mapped and executed across thousands of Dojo compute nodes * Collaborate with ML architects, kernel developers, and system engineers to ensure simulations reflect real-world AI training requirements * Design and implement tests to evaluate trade-offs in system resources, including memory bandwidth, capacity, latency, and compute, to optimize performance for large-scale AI workloads * Build and maintain software tools and frameworks to support simulation development, testing, and integration * Conduct performance analysis to identify bottlenecks and propose system-level optimizations * Stay current with advancements in ML model architectures, parallel computing, and system-level simulation techniques * Participate in code reviews, debugging, and testing to ensure robust and scalable simulation frameworks * Degree in Computer Science, Electrical Engineering, or proof of exceptional skills in related fields, or equivalent experience * Strong proficiency in C++ for developing high-performance simulation frameworks * Solid understanding of ML/deep learning model architectures, including how models are partitioned and mapped across multiple devices. Good understanding in Compute Architecture, Memory Hierarchy, and Dataflows * Experience in system-level simulation, parallel computing, or ML workload optimization * Knowledge of kernel development processes and how ML workloads are deployed on hardware accelerators * Familiarity with analytical simulation techniques for modeling high-level system behavior * Excellent problem-solving skills, with the ability to analyze complex systems and propose innovative solutions * Strong communication and collaboration skills to work effectively with cross-functional teams, including ML researchers, architects, and engineers * Ability to work onsite in our Palo Alto, CA office

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job