Architect and maintain end-to-end machine learning training pipelines on AWS (SageMaker, EKS, Step Functions) to ensure reliable and reproducible model development and deployment
Build and maintain infrastructure for production agentic applications using Amazon Bedrock and Bedrock AgentCore — including agent runtimes, memory, secure gateways, and observability at scale
Contribute to the architectural evolution of our ML platform, including evaluating MLOps tooling and participating in buy vs. build decisions
Implement AI/ML governance best practices for model versioning, testing, validation, maintenance, and security
Integrate MLOps best practices with Expel's SDLC, security, and infrastructure standards, working alongside SRE, Platform Engineering, and Security teams
Drive quality, reliability, and scalability improvements through thoughtful engineering and monitoring
Partner with data scientists, software engineers, and stakeholders to operationalize ML models reliably and at scale
Mentor and support junior engineers; foster a culture of engineering excellence
Create and maintain documentation, internal tooling, and enablement resources so practitioners across Expel can work effectively with ML systems
Stay current with the MLOps landscape and bring relevant innovations back to the team
Requirements
5+ years of relevant software engineering experience with meaningful focus on ML operations and infrastructure
Degree in Computer Science, Mathematics, Statistics, Engineering, or a related technical field preferred (or a compelling story)
Strong Python proficiency; familiarity with other languages (Go, JS) is a plus
Solid experience with CI/CD pipelines, infrastructure-as-code, and containerization for ML workloads
Hands-on experience with cloud-based ML platforms — AWS (SageMaker, Bedrock, Bedrock AgentCore) strongly preferred; GCP (Vertex AI) experience also valued
Proven experience operationalizing LLMs and building infrastructure for complex agentic applications — agent orchestration, memory, tool calling, RAG architectures
Familiarity with ML frameworks including Scikit-Learn, PyTorch, Spark, and TensorFlow
Working knowledge of continuous retraining, concept drift monitoring, and data drift detection in production
Tech Stack
AWS
Cloud
Google Cloud Platform
JavaScript
Python
PyTorch
Scikit-Learn
SDLC
Spark
Tensorflow
Go
Benefits
Offer unlimited PTO (that leadership models and encourages)
Up to 24 weeks of parental leave
Excellent health benefits
Pay you a monthly fitness and cell phone stipends — no receipts required
Support your professional growth with a conference benefit and continuous learning opportunities
Offer full remote flexibility — work from wherever you do your best work