Domino Data Lab is a company that builds software to help AI-driven organizations operate advanced data science solutions. The Staff Software Engineer will work on the Model Development Lifecycle Team, focusing on integrating model monitoring, enhancing tagging capabilities, and expanding LLM hosting capabilities to support AI model development and deployment.
Responsibilities:
- Integrate model monitoring to provide a holistic view of deployment health and performance
- Enhance tagging capabilities across Domino entities to improve discoverability and tracking
- Expand LLM hosting capabilities to address customer needs for scale, performance, and logging
- Innovate within our Domino Apps offering by incorporating feature requests from major customers
- Building Scalable Systems: Hands-on experience developing and managing high-performance back-end systems in distributed computing environments
- Collaboration Across Teams: Working closely with cross-functional teams to integrate systems with front-end interfaces and third-party services
- API Development: Designing and implementing secure, scalable APIs (e.g., RESTful APIs, gRPC)
- Performance Optimization: Profiling and optimizing back-end performance, especially in cloud environments or with container technologies like Docker and Kubernetes
- Testing and CI/CD: Using robust testing frameworks (unit, integration, end-to-end) and setting up CI/CD pipelines
- ML Model Deployment: Familiarity with model registries, versioning, and lifecycle management tools like MLflow or KubeFlow is a big plus
- Distributed Computing: Experience with frameworks like Apache Spark, Azure ML, or SageMaker is a plus
- Cloud Platforms: Proficiency with cloud providers (AWS, Azure, GCP) and deploying services in these environments
- Back-End Development: Expertise in languages such as Python, Java, Scala, or Go
Requirements:
- Hands-on experience developing and managing high-performance back-end systems in distributed computing environments
- Working closely with cross-functional teams to integrate systems with front-end interfaces and third-party services
- Designing and implementing secure, scalable APIs (e.g., RESTful APIs, gRPC)
- Profiling and optimizing back-end performance, especially in cloud environments or with container technologies like Docker and Kubernetes
- Using robust testing frameworks (unit, integration, end-to-end) and setting up CI/CD pipelines
- Familiarity with model registries, versioning, and lifecycle management tools like MLflow or KubeFlow
- Experience with frameworks like Apache Spark, Azure ML, or SageMaker
- Proficiency with cloud providers (AWS, Azure, GCP) and deploying services in these environments
- Expertise in languages such as Python, Java, Scala, or Go