We are seeking an experienced DevOps Engineer to join our infrastructure team, with a strong focus on managing and optimizing GPU-based compute environments for machine learning and deep learning workloads. In this role, you will be responsible for the end-to-end infrastructure lifecyclefrom provisioning with Terraform/Ansible to deploying ML models using modern frameworks like Hugging Face and Responsibilities :
Manage infrastructure using Terraform and Ansible
Deploy and monitor Kubernetes clusters with GPU support (including NVIDIA drivers and H100 SXM integration)
Implement and manage inferencing frameworks such as Ollama, Hugging Face, etc.
Support containerization (Docker), logging (EFK), and monitoring (Prometheus/Grafana)
Handle GPU resource scheduling, isolation, and scaling for ML/DL workloads
Collaborate closely with developers, data scientists, and ML engineers to streamline deployments and Skill Set :
58 years of hands-on experience in DevOps and infrastructure automation
Proven experience in managing GPU-based compute environments
Strong understanding of Docker, Kubernetes, and Linux internals
Familiarity with GPU server hardware and instance types
Proficient in scripting with Python and Bash
Good understanding of ML model deployment, inferencing workflows, and resource to Have :
Experience with AI/ML pipelines
Knowledge of cloud-native technologies (AWS/GCP/Azure) supporting GPU workloads
Exposure to model performance benchmarking and A/B testing
(ref:hirist.tech)
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job