RAZOR is looking for a strong DevOps engineer to help scale and operationalize their infrastructure as the platform grows. The role focuses on CI/CD, infrastructure automation, deployment reliability, observability, and GPU-oriented workload scaling.

Responsibilities:

Improve CI/CD pipelines, deployment workflows, and release reliability
Standardize infrastructure and deployment patterns across environments
Improve observability through logging, metrics, tracing, dashboards, and rollout monitoring
Partner closely with backend engineering on: deployment strategies, infrastructure automation, environment consistency, migration workflows, possible Kubernetes migration efforts
Support ML-oriented infrastructure as a secondary responsibility: SageMaker workloads, Ray clusters, GPU scaling patterns, distributed batch execution, autoscaling behavior, runtime/image management, artifact delivery/versioning
Work on deployment safety and rollback strategies
Ensure infrastructure consistency across environments
Automate release and environment promotion flows
Manage autoscaling and runtime stability
Orchestrate GPU workload and scaling efficiency
Develop operational tooling that reduces friction for engineering teams

Requirements:

Strong DevOps engineer
Experience with CI/CD pipelines
Experience with infrastructure automation
Experience with deployment reliability
Experience with observability
Experience with GPU-oriented workload scaling
Ability to improve CI/CD pipelines, deployment workflows, and release reliability
Ability to standardize infrastructure and deployment patterns across environments
Ability to improve observability through logging, metrics, tracing, dashboards, and rollout monitoring
Ability to partner closely with backend engineering on deployment strategies, infrastructure automation, environment consistency, migration workflows, and possible Kubernetes migration efforts
Support ML-oriented infrastructure including SageMaker workloads, Ray clusters, GPU scaling patterns, distributed batch execution, autoscaling behavior, runtime/image management, and artifact delivery/versioning
Ability to work on deployment safety and rollback strategies
Ability to ensure infrastructure consistency across environments
Ability to automate release and environment promotion flows
Ability to manage autoscaling and runtime stability
Ability to orchestrate GPU workload and scaling efficiency
Ability to create operational tooling that reduces friction for engineering teams
Experience with AWS
Experience with Terraform
Experience with Docker
Experience with Kubernetes
Experience with CI/CD systems
Experience with SageMaker
Experience with Ray
Experience with GPU compute infrastructure
Experience operating production infrastructure at meaningful scale
Strong in practical DevOps execution and operational reliability
Care about automation, observability, and deployment safety
Comfortable improving developer workflows and infrastructure tooling
Experience with distributed systems or GPU-oriented workloads

Lead DevOps/MLOps Engineer

Key skills

About this role

Responsibilities:

Requirements: