Meta Platforms, Inc.

SiteOps Global Production Platform Engineering Manager

Garland, TX, US

Onsite
Full-time
2 months ago
Save Job

Summary

Meta is seeking a forward thinking experienced Production Platform Engineering Manager to join the Data Center Site Operations team. The Production Platform Engineering (PPE) team is responsible for the overall performance of Meta's production compute, storage, and accelerator (GPU) platforms through their life-cycles in our data centers. This role will lead a subset of the overall PPE team. The role scope is focused on maintaining and improving the health of platforms from operational testing into mass production through end-of-life. Key responsibilities include identifying systemic hardware, firmware, and tooling issues; engaging in hands-on problem solving; and collaborating effectively with cross-functional engineering and tooling teams to improve performance of the fleet. Our data centers, and the tens of thousands of servers installed in them, are the foundation upon which our rapidly scaling infrastructure efficiently operates and upon which our services are delivered. Meta is at the leading edge of the global data center industry both in terms of how data centers are designed and operated. This person should enjoy working in an environment where adaptability and flexibility will be key to their success. We seek an individual who can quickly absorb and understand the technical challenges of subject matter experts and local site operations teams, create alignment between these globally distributed teams as well as partner organizations, and can set informed priorities and direction while getting buy-in and commitment from relevant stakeholders.BS or BA in technical field (electrical, computer science, or mechanical engineering) or comparable experience 10+ years experience in NPI (New Product Introduction) hardware development and/or validation, working with cross functional teams to deliver products to production Experience working across a global organization and building partnerships with cross functional teams inside and outside of the organization Experience troubleshooting and debugging hardware platforms Experience in processing and analyzing large sets of data Knowledge of server and storage platforms, principles, technologies, protocols, and standards Experience with GPU and accelerator based platform hardware that operates in computing clusters Experience managing multiple concurrent projects and managing tight timelines Experience working within an interdisciplinary team of hardware and operations engineers Experience working with Linux or UNIX Operating systems Technical skills creating documentation for users of all levels Experience mentoring others and leading technical teams Direct experience managing others Large-scale data center environment experience, including hardware deployments, system knowledge of Linux, Server Hardware, networking, network protocols, supply chain and Data Center automation Bash, PHP, Python, or Perl scripting experience Experience in data center system and process automation Leadership presentation skills

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job