Point72 is a leading global alternative investment firm that focuses on delivering superior returns through innovative strategies. They are seeking a Machine Learning Infrastructure Engineer to design and implement high-performance infrastructure for generative AI and machine learning workloads, collaborating with researchers and engineers to optimize models and workflows.
Responsibilities:
- Design and implement high-performance infrastructure to support large-scale generative AI and machine learning workloads, enabling faster model iteration and real business impact
- Design and operate distributed systems for model training, hyperparameter tuning, inference, and data preprocessing pipelines to deliver reliable end-to-end machine learning (ML) workflows
- Collaborate with ML researchers and engineers to produce models, optimizing compute utilization, training throughput, and inference latency
- Develop and automate deployment, orchestration, and CI/CD pipelines for models and data workflows using container orchestration and infrastructure-as-code (IaC)
- Implement observability, monitoring, and cost-management strategies for GPU and accelerator compute environments to maintain predictable performance and spend
- Evaluate, integrate, and benchmark emerging hardware and software technologies across cloud and on-prem environments to improve scalability and throughput
- Drive security, compliance, and operational runbooks for GenAI infrastructure including access controls, secrets management, and incident response procedures
- Troubleshoot, profile, and optimize performance across GPU and CPU compute stacks to remove bottlenecks and increase reliability
- Document architecture, operational practices, and mentor engineers to expand team capability and accelerate adoption of production-ready GenAI infrastructure
Requirements:
- Bachelor's or master's degree in computer science, electrical engineering, or a related technical field
- 3–7 years of experience building and maintaining scalable compute or machine learning infrastructure systems
- Deep understanding of distributed systems, container orchestration (Kubernetes), and public cloud platforms such as AWS, Google Cloud Platform, or Azure
- Hands-on experience with machine learning operations and infrastructure tools such as MLflow, Ray, Airflow, Kubeflow, and Terraform
- Strong understanding of reinforcement learning concepts and their infrastructure implications
- Proficiency in Python and systems-level programming in one or more languages such as Go, C++, or Rust
- Strong debugging, performance profiling, and optimization skills across GPU and CPU compute stacks
- Experience implementing monitoring, observability, and cost-optimization for GPU/accelerator-based compute environments
- Excellent collaboration and communication skills with a systems-thinking mindset
- Commitment to the highest ethical standards