Home
Jobs
Saved
Resumes
Senior ML Platform Engineer at CrowdStrike | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Senior ML Platform Engineer
CrowdStrike
Website
LinkedIn
Senior ML Platform Engineer
India
Full Time
2 hours ago
H1B Sponsor
Apply Now
Key skills
Airflow
AWS
Azure
Cloud
Distributed Systems
Docker
Google Cloud Platform
Grafana
Kubernetes
Linux
Prometheus
Python
Ray
Spark
Unix
ML
Jupyter
MLflow
Kubeflow
GCP
Google Cloud
K8s
About this role
Role Overview
Diagnose and resolve issues across Ray, Spark, Airflow, MLflow, JupyterHub, Kubeflow, and SLURM
Perform root cause analysis on production incidents affecting training and inference pipelines
Debug performance bottlenecks, resource contention, memory leaks, and scheduling conflicts
Develop debugging tools and diagnostic frameworks
Profile and optimize Ray clusters and Spark jobs on K8s and Cloud (EMR/Dataproc)
Troubleshoot JupyterHub spawner issues, kernel crashes, and resource allocation
Optimize SLURM job scheduling, GPU allocation, and HPC cluster utilization
Build observability solutions and automated health checks
Develop runbooks, alerting workflows, and incident response procedures
Maintain platform stability metrics (SLAs, error rates, latency)
Partner with ML and ML Platform engineers to resolve workflow issues
Conduct post-mortems and mentor on debugging techniques
Requirements
12+ years in distributed systems engineering
5+ years debugging ML platforms in production
Deep expertise in 3+ one of: Ray, Spark, JupyterHub, SLURM, K8
Performance profiling, optimization, and capacity planning
Technical Skills (Expertise in at least one):
Distributed ML: Ray, Spark, SLURM, Jupyter Ecosystem (debugging failures, performance tuning)
ML Platforms: Airflow, MLflow, JupyterHub (troubleshooting core components)
Infrastructure: Kubernetes, Docker, AWS/GCP/Azure/OCI
Observability: Profiling tools, distributed tracing, Prometheus, Grafana, log aggregation
Programming: Expert Python debugging, multi-language proficiency, Linux/Unix
Tech Stack
Airflow
AWS
Azure
Cloud
Distributed Systems
Docker
Google Cloud Platform
Grafana
Kubernetes
Linux
Prometheus
Python
Ray
Spark
Unix
Benefits
Market leader in compensation and equity awards
Comprehensive physical and mental wellness programs
Competitive vacation and holidays for recharge
Paid parental and adoption leaves
Professional development opportunities for all employees regardless of level or role
Employee Networks, geographic neighborhood groups, and volunteer opportunities to build connections
Vibrant office culture with world class amenities
Great Place to Work Certified™ across the globe
Apply Now
Home
Jobs
Saved
Resumes