Home
Jobs
Saved
Resumes
Staff AI Ops Engineer at Calix | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Staff AI Ops Engineer
Calix
Remote
Website
LinkedIn
Staff AI Ops Engineer
United States
Full Time
1 day ago
$136,000 - $265,700 USD
Visa Sponsor
Apply Now
Key skills
Airflow
Cloud
Docker
Google Cloud Platform
Grafana
Kubernetes
Prometheus
Python
PyTorch
Terraform
AI
ML
GenAI
MLflow
Kubeflow
GCP
Google Cloud
Vertex AI
ELK Stack
Performance Optimization
CI/CD
Communication
About this role
Role Overview
Design, implement, and maintain scalable infrastructure for ML and GenAI applications
Deploy, operate, and troubleshoot production ML/GenAI pipelines/services
Build and optimize CI/CD pipelines for ML model deployment and serving
Scale compute resources across CPU/GPU architectures to meet performance requirements
Implement container orchestration with Kubernetes
Architect and optimize cloud resources on GCP for ML training and inference
Setup and maintain runtime frameworks and job management systems (Airflow, KubeFlow, MLflow, etc.)
Establish monitoring, logging and alerting for systems observability
Optimize system performance and resource utilization for cost efficiency
Develop and enforce AIOps best practices across the organization
Requirements
Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience)
8+ years of overall software engineering experience
3+ years of focused experience in DevOps/AIOps or similar ML infrastructure roles
Proficient in IaC, using Terraform
Strong experience with containerization and orchestration using Docker and Kubernetes
Demonstrated expertise in cloud infrastructure management on GCP
Proficiency with workflow management such as Airflow & Kubeflow
Strong CI/CD expertise with experience implementing automated testing and deployment pipelines
Experience with scaling distributed compute architectures utilizing various accelerators (CPU/GPU)
Solid understanding of system performance optimization techniques
Experience implementing comprehensive observability solutions for complex systems
Knowledge of monitoring and logging tools (Prometheus, Grafana, ELK stack)
Strong proficiency in Python
Familiarity with ML frameworks such as PyTorch and ML platforms like Vertex AI
Excellent problem-solving skills and ability to work independently
Strong communication skills and ability to work effectively in cross-functional teams
Tech Stack
Airflow
Cloud
Docker
Google Cloud Platform
Grafana
Kubernetes
Prometheus
Python
PyTorch
Terraform
Benefits
Health insurance
401(k) matching
Flexible work arrangements
Professional development
Possible bonuses
Apply Now
Home
Jobs
Saved
Resumes