Design, build, and evolve platform capabilities that support ML training, batch inference, and model deployment workflows at scale.
Own and improve core platform components (e.g., compute orchestration, data pipelines, inference systems) used by multiple teams across Blue River and John Deere.
Continuously enhance platform reliability, scalability, and performance, with a focus on real-world ML workloads.
Enable ML engineers to move faster by building intuitive, well-documented platform tools and workflows across the model lifecycle (experimentation, deployment, and iteration).
Improve model inference performance and throughput while balancing trade-offs among cost, latency, and reliability.
Support and scale distributed training and inference systems, including frameworks such as Ray and related tooling.
Develop and optimize hybrid compute environments (cloud + on-prem/GPU infrastructure) to support large-scale ML workloads.
Build and maintain infrastructure leveraging Kubernetes, Slurm, and cloud platforms (AWS preferred).
Identify and resolve bottlenecks in compute, storage, and data movement pipelines.
Evaluate existing platform systems and make thoughtful decisions on when to extend, refactor, or rebuild components.
Drive improvements in system architecture, balancing short-term delivery with long-term platform health.
Contribute to shaping the platform roadmap and technical direction in response to evolving business and ML needs.
Partner closely with ML engineers, robotics teams, infrastructure teams, and product stakeholders to translate requirements into scalable platform solutions.
Act as a technical bridge between teams, ensuring platform capabilities align with real-world use cases and constraints.
Influence platform adoption and best practices across multiple teams.
Support platform capabilities that enable simulation-based testing and validation of ML systems, including synthetic data workflows.
Improve tooling that allows teams to test and validate models before production deployment.
Provide technical guidance and mentorship to junior engineers on platform and systems design.
Lead implementation efforts for key platform initiatives and ensure high-quality execution.
Demonstrate strong ownership and accountability for delivering impactful platform improvements.
Requirements
5+ years of professional engineering experience, with a focus on platform, infrastructure, or systems engineering.
Strong technical judgment, balancing the evolution of legacy platforms with the design and delivery of new, greenfield components shared across multiple teams and workloads.
Excellent Python skills, used in production systems, tooling, and platform components.
Solid understanding of ML systems and the end-to-end model development lifecycle, from experimentation to deployment and iteration.
Hands-on experience or strong familiarity with cloud platforms (AWS preferred) and container orchestration systems such as Kubernetes and Slurm.
Ability to partner effectively with ML engineers, infra teams, and product stakeholders to translate requirements into platform capabilities.
Ability to quickly ramp up on new domains, tools, and complex existing systems.
Tech Stack
AWS
Cloud
Kubernetes
Python
Ray
Benefits
Visa sponsorship will be considered on a case-by-case basis.