Point72 is a leading global alternative investment firm that focuses on delivering superior returns through innovative strategies. They are seeking a Machine Learning Infrastructure Engineer to design and implement high-performance infrastructure for generative AI and machine learning workloads, collaborating with researchers and engineers to optimize models and workflows.

Responsibilities:

Design and implement high-performance infrastructure to support large-scale generative AI and machine learning workloads, enabling faster model iteration and real business impact
Design and operate distributed systems for model training, hyperparameter tuning, inference, and data preprocessing pipelines to deliver reliable end-to-end machine learning (ML) workflows
Collaborate with ML researchers and engineers to produce models, optimizing compute utilization, training throughput, and inference latency
Develop and automate deployment, orchestration, and CI/CD pipelines for models and data workflows using container orchestration and infrastructure-as-code (IaC)
Implement observability, monitoring, and cost-management strategies for GPU and accelerator compute environments to maintain predictable performance and spend
Evaluate, integrate, and benchmark emerging hardware and software technologies across cloud and on-prem environments to improve scalability and throughput
Drive security, compliance, and operational runbooks for GenAI infrastructure including access controls, secrets management, and incident response procedures
Troubleshoot, profile, and optimize performance across GPU and CPU compute stacks to remove bottlenecks and increase reliability
Document architecture, operational practices, and mentor engineers to expand team capability and accelerate adoption of production-ready GenAI infrastructure

Requirements:

Bachelor's or master's degree in computer science, electrical engineering, or a related technical field
3–7 years of experience building and maintaining scalable compute or machine learning infrastructure systems
Deep understanding of distributed systems, container orchestration (Kubernetes), and public cloud platforms such as AWS, Google Cloud Platform, or Azure
Hands-on experience with machine learning operations and infrastructure tools such as MLflow, Ray, Airflow, Kubeflow, and Terraform
Strong understanding of reinforcement learning concepts and their infrastructure implications
Proficiency in Python and systems-level programming in one or more languages such as Go, C++, or Rust
Strong debugging, performance profiling, and optimization skills across GPU and CPU compute stacks
Experience implementing monitoring, observability, and cost-optimization for GPU/accelerator-based compute environments
Excellent collaboration and communication skills with a systems-thinking mindset
Commitment to the highest ethical standards

Machine Learning Infrastructure Engineer, GenAI Technology

Key skills

About this role

Responsibilities:

Requirements: