Design, build, and maintain core services of the AI compute control plane: job scheduling, cluster management, resource quota enforcement, and compute lifecycle management
Lead the design and implementation of job scheduling, resource quota enforcement, and compute lifecycle management systems
Own the control plane services that manage GPU/CPU workload orchestration — from job submission through execution, monitoring, and teardown
Design reliable, fault-tolerant worker services and supervisor patterns for long-running compute workloads
Build and evolve the data layer that tracks job state, cluster state, and resource ownership across the platform
Partner closely with ASML engineers to deeply understand their workflows and translate requirements into robust platform capabilities
Develop and maintain Python SDKs and CLIs that ML engineers use to interact with the platform — prioritizing developer experience and reliability
Drive end-to-end ownership of features — from API design and data modelling through deployment and production operations
Establish observability standards (metrics, tracing, alerting) for scheduling and compute systems
Lead incident response and root cause analysis for production issues in compute orchestration
Mentor junior and mid-level engineers on system design, scheduling patterns, and platform engineering best practices
Requirements
B.Tech / M.Tech degree in Computer Science from a premier institute
9+ years of proven experience in backend platform engineering, distributed systems, or infrastructure software
Strong computer science fundamentals — particularly in distributed systems, concurrency, and system design
Experience building or operating job scheduling, workflow orchestration, or compute management systems (e.g. Argo, Airflow, Ray, Slurm, or similar)
Proficiency in Python and/or Java, with strong async programming skills
Experience designing and operating services backed by relational databases (PostgreSQL preferred) at scale
Deep understanding of Cloud Platforms, with preference for AWS; familiarity with Azure or GCP is a plus
Proven track record of working directly with internal engineering customers (ML engineers, researchers) to shape platform roadmap
Strong problem-solving skills with the ability to own ambiguous, complex systems independently
Experience with Kubernetes at the workload/scheduling layer (not just operations)