TieTalent

Senior Machine Learning Operations Engineer

Boulder, CO, US

$168k/year
about 1 month ago
Save Job

Summary

About

Job Summary

We are seeking an experienced Machine Learning Operations (MLOps) Engineer to join and help shape our new MLOps team. This role focuses on deploying and optimizing machine learning models for always-on, high-availability systems in real-world, real-time unclassified and classified environments. As part of a new and growing team, you will have the unique opportunity to evangelize MLOps practices, contribute to the development of an on-premises development platform, and drive innovation in mission-critical applications.

Responsibilities

Deploy and maintain high-performing ML models (e.g., ensembles of LSTMs and Random Forests) in real-time environments

Monitor deployed models for drift or performance degradation and implement automated retraining pipelines.

Implement advanced deployment strategies (e.g., Blue-Green, Canary, Champion-Challenger).

Develop modular and flexible ML pipelines that ensure uptime and reliability

Build and manage scalable infrastructure using Kubernetes, Docker, Terraform, and related tools

Design and implement an on-premises development platform using Kubeflow to replicate cloud capabilities in classified environments

Set up robust monitoring, logging, and alerting systems using Prometheus, Grafana, and Loki.

Optimize performance metrics like inference latency and system throughput while ensuring fault tolerance

Work with cross-functional teams, including Data Engineering, Machine Learning, and DevOps, to integrate and enhance ML systems

Define touchpoints and handoffs with DevOps and Data Engineering to ensure seamless integration of ML workflows with existing infrastructure and data pipelines

Mentor junior team members and contribute to building a collaborative and innovative team culture

Other duties as assigned

Requirements

8+ years, including leading large-scale ML model deployments and scaling production environments

Expertise in architecting Python applications for large-scale systems, mentoring junior engineers in Python best practices, and optimizing code for high performance

Proven leadership in designing enterprise-grade CI/CD systems, incorporating advanced features like parallel testing, rollback strategies, and security hardening

Advanced expertise in designing and optimizing distributed pipelines with Protobufs and ZeroMQ, ensuring fault tolerance and scalability.

Advanced expertise in designing workflows using MLflow or Kubeflow to streamline experimentation and production deployments

Expertise in architecting complex Kubernetes and Terraform configurations for distributed systems, incorporating advanced features like auto-scaling and load balancing

Preferred Qualifications

Familiarity with C++ and/or Rust

Experience with workflow orchestration tools such as Airflow or Prefect

Experience with distributed data processing frameworks such as PySpark

Familiarity with SQL and modern database technologies (e.g., MinIO, Yugabyte)

Experience with DVC, Ansible, Kustomize, Helm, Prometheus, and Grafana

Understanding of secure software development practices and/or experience working in classified environments

Education

Bachelor’s, Master’s, or PhD in Computer Science, Engineering, or a related technical field

Relevant certifications (e.g., Certified Kubernetes Administrator, Certified Kubernetes Application Developer, Terraform Associate) are a plus

Soft Skills

Strong problem-solving and analytical skills

Excellent communication and collaboration capabilities

Ability to thrive in a dynamic, fast-paced environment

Good verbal and written communication skills

Detail oriented

Benefits

SciTec offers a highly competitive salary and benefits package, including:

Employee Stock Ownership Plan (ESOP)

3% Fully Vested Company 401K Contribution (no employee contribution required)

100% company paid HSA Medical insurance, with a choice of 2 buy-up options

80% company paid Dental insurance

100% company paid Vision insurance

100% company paid Life insurance

100% company paid Long-term Disability insurance

Short-term Disability insurance

Annual Profit-Sharing Plan

Discretionary Performance Bonus

Paid Parental Leave

Generous Paid Time Off, including Holiday, Vacation, and Sick Pay

Flexible Work Hours

The pay range for this position is $141,000- $168,000 / year. SciTec considers several factors when extending an offer of employment, including but not limited to the role and associated responsibilities, a candidate's work experience, education/training, and key skills. This is not a guarantee of compensation.

SciTec is committed to hiring and retaining a diverse workforce and is proud to be an Equal Opportunity/Affirmative Action employer. M/F/VETS/Disabled

#ND123

Nice-to-have skills

  • Machine Learning
  • Kubernetes
  • Docker
  • Terraform
  • Prometheus
  • Grafana
  • Python
  • ZeroMQ
  • C++
  • Rust
  • PySpark
  • SQL
  • Ansible
  • Boulder, Colorado

Work experience

  • Machine Learning
  • DevOps
  • Data Engineer

Languages

  • English

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job