Overstory is tackling the climate crisis through innovative technology to enhance the electrical grid's resilience. The Staff Machine Learning Ops Engineer will design and build machine learning operations, ensuring the reliability and maintainability of models while collaborating with various teams to optimize technical solutions.
Responsibilities:
- Design and build the foundations of our machine learning operations, ensuring our models are reliable, maintainable, and deliver real value to customers
- Help architect end-to-end systems for experiment tracking, data management, and scalable deployment
- Design, build, and maintain processes and systems such as: automated pipelines for training, testing, and deploying ML models
- Experiment tracking systems for performance metrics, data and model versioning, and documentation
- Processes and systems for the full model lifecycle, including registries, release and rollback strategies, and scalable model serving
- Monitoring and alerting for prediction quality, system health, and cost optimization
- Influence the direction of data and ML within Overstory by advocating for a balance between MLOps best practices and quick slices of value
- Align technical solutions with customer needs in collaborating with both engineering and product
- Ensure our MLOps systems support regulatory, privacy, and security requirements
Requirements:
- 10+ years of experience with designing and building production-grade ML pipelines and systems – but don't filter yourself out if you feel you're a strong candidate with 5+ years
- Strong knowledge of experiment tracking, model deployment strategies, data versioning, and monitoring
- Experience with ML infrastructure tools (e.g. MLflow, Kubeflow, Airflow, feature stores, model registries)
- Strong communication skills and ability to align technical solutions with business goals
- Comfortable making architectural decisions and balancing best practices with practical trade-offs
- Familiarity with GCP and VertexAI preferred, but not required
- Experience in remote-first or globally distributed teams
- Background in image processing, geospatial, or spatio-temporal data processing
- Prior work on real-time prediction systems or active-learning loops
- Knowledge of regulatory, privacy, or security considerations in ML
- Experience optimizing cloud infrastructure costs for ML workloads
- Familiarity with Overstory's mission domains (e.g. satellite imagery, forestry, utilities)