RAZOR is looking for a strong DevOps engineer to help scale and operationalize their infrastructure as the platform grows. The role focuses on CI/CD, infrastructure automation, deployment reliability, observability, and GPU-oriented workload scaling.
Responsibilities:
- Improve CI/CD pipelines, deployment workflows, and release reliability
- Standardize infrastructure and deployment patterns across environments
- Improve observability through logging, metrics, tracing, dashboards, and rollout monitoring
- Partner closely with backend engineering on: deployment strategies, infrastructure automation, environment consistency, migration workflows, possible Kubernetes migration efforts
- Support ML-oriented infrastructure as a secondary responsibility: SageMaker workloads, Ray clusters, GPU scaling patterns, distributed batch execution, autoscaling behavior, runtime/image management, artifact delivery/versioning
- Work on deployment safety and rollback strategies
- Ensure infrastructure consistency across environments
- Automate release and environment promotion flows
- Manage autoscaling and runtime stability
- Orchestrate GPU workload and scaling efficiency
- Develop operational tooling that reduces friction for engineering teams
Requirements:
- Strong DevOps engineer
- Experience with CI/CD pipelines
- Experience with infrastructure automation
- Experience with deployment reliability
- Experience with observability
- Experience with GPU-oriented workload scaling
- Ability to improve CI/CD pipelines, deployment workflows, and release reliability
- Ability to standardize infrastructure and deployment patterns across environments
- Ability to improve observability through logging, metrics, tracing, dashboards, and rollout monitoring
- Ability to partner closely with backend engineering on deployment strategies, infrastructure automation, environment consistency, migration workflows, and possible Kubernetes migration efforts
- Support ML-oriented infrastructure including SageMaker workloads, Ray clusters, GPU scaling patterns, distributed batch execution, autoscaling behavior, runtime/image management, and artifact delivery/versioning
- Ability to work on deployment safety and rollback strategies
- Ability to ensure infrastructure consistency across environments
- Ability to automate release and environment promotion flows
- Ability to manage autoscaling and runtime stability
- Ability to orchestrate GPU workload and scaling efficiency
- Ability to create operational tooling that reduces friction for engineering teams
- Experience with AWS
- Experience with Terraform
- Experience with Docker
- Experience with Kubernetes
- Experience with CI/CD systems
- Experience with SageMaker
- Experience with Ray
- Experience with GPU compute infrastructure
- Experience operating production infrastructure at meaningful scale
- Strong in practical DevOps execution and operational reliability
- Care about automation, observability, and deployment safety
- Comfortable improving developer workflows and infrastructure tooling
- Experience with distributed systems or GPU-oriented workloads