The Judge Group is seeking an experienced Machine Learning Engineer to build advanced algorithmic and AI capabilities. This role combines deep technical modeling expertise with infrastructure engineering to develop and operate end-to-end ML/AI systems at scale, working closely with Data Science, Data Engineering, and Architecture teams.
Responsibilities:
- Design and optimize machine learning models including deep learning architectures, LLMs, and BERT-based classifiers
- Build distributed training workflows using PyTorch and similar frameworks
- Fine‑tune large language models and optimize inference performance (Neuron Compiler, ONNX, vLLM)
- Optimize models for GPU, TPU, and AWS Inferentia/Trainium
- Design AI services for both real‑time and batch processing use cases
- Lead development of ML infrastructure covering data ingestion, feature engineering, training, and serving
- Build scalable inference systems for real-time and batch predictions
- Deploy models across EC2, EKS, SageMaker, and specialized inference hardware
- Implement and maintain core MLOps capabilities including Feature Store, Observability, Governance, and automated pipelines
- Build Infrastructure‑as‑Code workflows for training, evaluation, and deployment
- Develop MLOps tooling to simplify workflows for data science teams
- Create CI/CD pipelines for ML models and infrastructure components
- Monitor and optimize ML systems for performance, accuracy, latency, and cost efficiency
- Implement system profiling and observability across the ML lifecycle
- Partner with Data Engineering to ensure optimal data availability and quality
- Collaborate with Architecture, Governance, and Security teams to meet enterprise standards
- Provide technical guidance on modeling methods and AI infrastructure best practices
Requirements:
- An experienced Machine Learning Engineer (Contract) to build advanced algorithmic and AI capabilities across Personalization, Generative AI, Forecasting, and Decision Science
- Deep technical modeling expertise with infrastructure engineering to develop and operate end‑to‑end ML/AI systems at scale
- Design and optimize machine learning models including deep learning architectures, LLMs, and BERT-based classifiers
- Build distributed training workflows using PyTorch and similar frameworks
- Fine‑tune large language models and optimize inference performance (Neuron Compiler, ONNX, vLLM)
- Optimize models for GPU, TPU, and AWS Inferentia/Trainium
- Design AI services for both real‑time and batch processing use cases
- Lead development of ML infrastructure covering data ingestion, feature engineering, training, and serving
- Build scalable inference systems for real-time and batch predictions
- Deploy models across EC2, EKS, SageMaker, and specialized inference hardware
- Implement and maintain core MLOps capabilities including Feature Store, Observability, Governance, and automated pipelines
- Build Infrastructure‑as‑Code workflows for training, evaluation, and deployment
- Develop MLOps tooling to simplify workflows for data science teams
- Create CI/CD pipelines for ML models and infrastructure components
- Monitor and optimize ML systems for performance, accuracy, latency, and cost efficiency
- Implement system profiling and observability across the ML lifecycle
- Partner with Data Engineering to ensure optimal data availability and quality
- Collaborate with Architecture, Governance, and Security teams to meet enterprise standards
- Provide technical guidance on modeling methods and AI infrastructure best practices