Bright Machines is an innovative company focused on transforming the manufacturing industry through intelligent automation. The Senior Platform/MLOps Engineer will be responsible for building scalable systems for AI/ML infrastructure, including training pipelines and model deployments, while collaborating with various teams to enhance manufacturing operations.
Responsibilities:
- Design, implement, and maintain reliable, scalable, and secure infrastructure, applications, and tooling, with a focus on our ML/AI pipelines and workloads
- Write clean, maintainable code, and perform peer code-reviews
- Write clear and concise documentation and engage in cross-team communication and knowledge sharing
- Work with other team members to investigate design approaches, prototype new technology and evaluate technical feasibility
- Pair with adjacent teams to understand how your frameworks and infrastructure are actually used in the field, continuously improving them and leveraging recent advances to improve developer velocity
Requirements:
- At least 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
- B.S. or M.S. degree (or equivalent) in Computer Science, Engineering, or a related field
- Proficiency in at least one modern programming languages (Python, Javascript, C#, Go, etc)
- Demonstrated industry best-practices in MLOps
- Proficiency with CI/CD tools and GitOps workflows
- Familiarity with running GPU workloads in kubernetes
- Strong knowledge of Kubernetes (self-hosted and managed) and modern k8s paradigms (e.g. CNCF)
- Proficiency with Infrastructure as Code tools (Terraform, etc) and configuration management tools (Ansible, etc)
- Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry)
- Travel 25%
- Experience in air-gapped or extremely strict security environments
- Experience communicating with users, technical leaders and management to collect requirements, describe system designs, and architecting software systems that meets your stakeholders needs
- Knowledge and demonstrated application of software engineering best practices relating to the SDLC including code reviews, SCM, CI/CD, testing, and operations
- Demonstrated ability to mentor and grow other team members