Calix provides cloud and software solutions for communications service providers. They are seeking a highly skilled Staff AI Ops Engineer to build and maintain infrastructure for machine learning and generative AI applications, collaborating closely with data scientists and software developers to ensure system robustness and efficiency.
Responsibilities:
- Design, implement, and maintain scalable infrastructure for ML and GenAI applications
- Deploy, operate, and troubleshoot production ML/GenAI pipelines/services
- Build and optimize CI/CD pipelines for ML model deployment and serving
- Scale compute resources across CPU/GPU architectures to meet performance requirements
- Implement container orchestration with Kubernetes
- Architect and optimize cloud resources on GCP for ML training and inference
- Setup and maintain runtime frameworks and job management systems (Airflow, KubeFlow, MLflow, etc.)
- Establish monitoring, logging and alerting for systems observability
- Optimize system performance and resource utilization for cost efficiency
- Develop and enforce AIOps best practices across the organization
Requirements:
- Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience)
- 8+ years of overall software engineering experience
- 3+ years of focused experience in DevOps/AIOps or similar ML infrastructure roles
- Proficient in IaC, using Terraform
- Strong experience with containerization and orchestration using Docker and Kubernetes
- Demonstrated expertise in cloud infrastructure management on GCP
- Proficiency with workflow management such as Airflow & Kubeflow
- Strong CI/CD expertise with experience implementing automated testing and deployment pipelines
- Experience with scaling distributed compute architectures utilizing various accelerators (CPU/GPU)
- Solid understanding of system performance optimization techniques
- Experience implementing comprehensive observability solutions for complex systems
- Knowledge of monitoring and logging tools (Prometheus, Grafana, ELK stack)
- Strong proficiency in Python
- Familiarity with ML frameworks such as PyTorch and ML platforms like Vertex AI
- Excellent problem-solving skills and ability to work independently
- Strong communication skills and ability to work effectively in cross-functional teams