Responsibilities
* The AI Infrastructure Tools Engineer will be responsible for driving tools effort required to provide high uptime and service availability of AI Infrastructure at Cerebras.
* Architect, design, and develop framework and tools for monitoring, operations, and maintenance of AI Infrastructure.
* Collaborate with Cluster Deployment, Network Operations, and Cluster Operations teams to understand their needs and ensure tools meet their requirements.
* Identify areas for improvement and implement new tools or technologies to enhance AI infrastructure efficiency, reliability, and security.
* Use AI to analyze data and identify trends, patterns, and anomalies.
* Develop User Interface (UI), reporting, analytics, visualizations, and dashboards consumable by engineers, leadership, and customers.
Skills And Qualifications
* Bachelor's degree or higher in Electrical Engineering, Computer Engineering or Computer Science.
* 10+ years of experience designing and building infrastructure tools.
* Must be proficient in python development, and or Golang.
* Expertise with cloud platforms like AWS, Azure, or Google Cloud.
* Manage tooling infrastructure using code, enabling repeatable and consistent deployments.
* Experience with containerization (e.g., Docker) and orchestration (e.g., Kubernetes).
* Expertise in setting up and maintain monitoring and logging systems
* Experience with network monitoring and analytics tools like Prometheus, Grafana; and familiarity with GNMI, OpenConfig, OpenTelemetry, or New Relic.
* Experience building dashboards and analytics. Understanding of UX/UI design techniques.
* Excellent problem-solving and analytical skills.
* Strong communication and collaboration skills.