DataRobot is a company that delivers AI solutions to maximize impact and minimize business risk. As a Staff Software Engineer in the AI Compute team, you will lead technical initiatives, mentor engineers, and ensure the development of secure and efficient systems for AI products.
Responsibilities:
- Build a system that ensures micro-services are secure, performant, reliable, and can go from idea to production in an hour
- Build a system that continuously provides recommendations to right-size computing resources for Kubernetes to ensure efficient cloud spending for ourselves and our customers
- Design and architect automated quality platforms to go from Enterprise-Grade releases from once-a-quarter to once-a-week to once-per-day to once-per-hour without sacrificing performance, security, or reliability
- Work with Product, Legal and Security to ensure the continuous delivery processes you build are compliant and secure
- Work with the team to ensure pipelines have clear playbooks and can operate 24/7 without you
- Work with a diverse group of architects and platform engineers across our R&D department to set continuous delivery and performance requirements for all production services
- Work with internal product managers to set roadmaps and define milestones to deliver innovative and simple solutions to our many teams’ continuous delivery and platform engineering issues
Requirements:
- 8+ years of experience
- Expert in developing a wide variety of software with Python (5+ years)
- Experience designing and operating diverse CI/CD pipelines with Harness.io
- Experience designing and innovating large-scale horizontal and vertically-scaled build, testing, and deployment systems for Kubernetes environments and familiarity with Helm charts
- Expert proficiency in Kubernetes architecture and operations including resource management/scheduling, auto-scaling, Gateway API/Ingress, Prometheus, and OpenTelemetry or experience with other orchestrators like nomad/slurm
- Experience with gpu clusters, either as a user or administrator or experience in multi-node AI/ML
- Passionate about developing products for your fellow internal developers
- Experience setting technical direction, making architectural decisions, and driving consensus across multiple teams and stakeholders
- Able to influence others in the organization even where they lack explicit authority
- Proven track record for leading large-scale projects to completion across dozens of teams/Pods
- Experience mentoring senior engineers, fostering a positive team culture, and promoting continuous learning and improvement among colleagues
- Operational excellency to continuously define and improve SLA (Service Level Agreement) working backward from customer experience for all the software components this team manages
- Golang, Terraform and Terragrunt
- Chronosphere
- Multi-cloud experience (AWS, Azure, GCP, and OpenShift)