AlphaX is building distributed observability and multi-cloud intelligence software for modern AI and data-intensive systems. They are seeking a Senior DevOps / Infrastructure Engineer to design, scale, and operate the systems that power their observability platform, focusing on multi-region cloud deployments and system reliability.
Responsibilities:
- Architect and operate multi-region deployments across AWS, GCP, or Azure
- Build and maintain high-throughput telemetry ingestion pipelines
- Design autoscaling and failover strategies for mission-critical services
- Own observability systems including Prometheus, Grafana, and distributed tracing
- Improve MTTR and operational readiness processes
- Manage CI/CD pipelines, GitOps workflows, and automated deployments
- Collaborate with backend teams on API performance and infrastructure reliability
- Harden infrastructure for security, compliance, and tenant isolation
- Drive long-term infrastructure roadmap and architectural direction
Requirements:
- Deep experience with Kubernetes, Docker, and container orchestration
- Strong background in distributed systems and multi-region architectures
- Experience with high-ingest, streaming, or event-driven systems
- Hands-on experience with Prometheus, Grafana, and tracing/alerting frameworks
- Proficiency with Terraform or similar infrastructure-as-code tools
- Experience building and maintaining CI/CD pipelines
- Strong understanding of AWS, GCP, or Azure
- Python or Go scripting for automation and tooling
- Experience operating high-availability, production-critical systems
- Cloudflare (DNS, CDN, WAF, SSL)
- Helm, Kustomize, or similar Kubernetes tooling
- Experience with time-series databases, vector databases, or high-throughput storage systems
- Background in SRE, platform engineering, or observability tooling
- Experience supporting AI/ML workloads or GPU-based systems
- Familiarity with OpenTelemetry, Jaeger, or similar distributed tracing frameworks