Software Engineer, AI Systems Performance Modeling

Palo Alto, CA, US

Onsite

Full-time

2 days ago

Save Job

Summary

Join Tesla's Dojo Performance Team to design and optimize cutting-edge system-level simulation frameworks for AI accelerators. You will simulate the performance of thousands of Dojo compute nodes operating in parallel to drive state-of-the-art machine learning (ML) workloads. This role centers on modeling large-scale AI training systems, to evaluate performance of new kernels and mapping strategies. By analyzing trade-offs between memory, compute, and communication across system resources, you will help push the boundaries of AI performance and efficiency. * Develop system-level simulation frameworks to model the performance of massively parallel AI accelerators, including compute distribution, memory hierarchy, interconnects, and dataflow * Simulate and analyze how large-scale ML workloads, from FSD to LLMs, are mapped and executed across thousands of Dojo compute nodes * Collaborate with ML architects, kernel developers, and system engineers to ensure simulations reflect real-world AI training requirements * Design and implement tests to evaluate trade-offs in system resources, including memory bandwidth, capacity, latency, and compute, to optimize performance for large-scale AI workloads * Build and maintain software tools and frameworks to support simulation development, testing, and integration * Conduct performance analysis to identify bottlenecks and propose system-level optimizations * Stay current with advancements in ML model architectures, parallel computing, and system-level simulation techniques * Participate in code reviews, debugging, and testing to ensure robust and scalable simulation frameworks * Degree in Computer Science, Electrical Engineering, or proof of exceptional skills in related fields, or equivalent experience * Strong proficiency in C++ for developing high-performance simulation frameworks * Solid understanding of ML/deep learning model architectures, including how models are partitioned and mapped across multiple devices. Good understanding in Compute Architecture, Memory Hierarchy, and Dataflows * Experience in system-level simulation, parallel computing, or ML workload optimization * Knowledge of kernel development processes and how ML workloads are deployed on hardware accelerators * Familiarity with analytical simulation techniques for modeling high-level system behavior * Excellent problem-solving skills, with the ability to analyze complex systems and propose innovative solutions * Strong communication and collaboration skills to work effectively with cross-functional teams, including ML researchers, architects, and engineers * Ability to work onsite in our Palo Alto, CA office

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job

Software Engineer, AI Systems Performance Modeling

Palo Alto, CA, US

Summary

How strong is your resume?

How strong is your resume?

Our Company

Career Guides

Career Advice

Support