Operate critical infrastructure across Azure (our primary cloud) and AWS (where Conductor, our agentic healthcare automation platform, runs), including Kubernetes, managed databases, and networking
Own the observability stack and drive operational excellence — metrics, logs, traces, alerting, and the practices that turn signals into action
Build and evolve our CI/CD orchestration to make releases fast and safe
Design, build, and maintain the internal tooling, shared libraries, and automation that the Engineering team relies on every day
Serve as the engineering team's first line of support for infrastructure, CI/CD, and access issues — triaging, resolving, and turning recurring asks into self-serve tooling
Drive cloud cost engineering — vendor commitment decisions, cluster rightsizing, lifecycle policies, and the unsexy-but-meaningful work of making our infrastructure spend match our actual usage
Partner with product engineering teams throughout the SDLC to ensure infrastructure dependencies are secure, interoperate cleanly, and scale
Improve our information security and regulatory compliance posture (SOC 2, HIPAA, PHIPA) through thoughtful platform design
Raise the bar on developer experience by building high-quality shared code, documenting clearly, and actively communicating best practices
Requirements
Bachelor's degree in Software Engineering, Computer Science, or equivalent practical experience
4+ years developing and operating cloud-native services on a major cloud (Azure, AWS, or GCP)
A collaborative mindset, strong written and verbal communication, and a track record of partnering well across teams while maintaining a strong sense of ownership
Comfort delivering solutions to ambiguous, open-ended problems
Experience debugging production issues across the stack — correlating traces, logs, Kubernetes events, and infrastructure state to isolate root cause
Hands-on expertise configuring and operating Kubernetes
An interest in working with AI-assisted tools across the full lifecycle (development, operations, and incident response)
You have at least 3 of the following:
Experience developing services in Go, especially cloud-native REST APIs
Infrastructure as Code with Terraform (or equivalent)
Production experience with observability tooling (OpenTelemetry, Prometheus, Grafana, Datadog, or similar)
Building and maintaining CI/CD systems (GitHub Actions, ArgoCD, etc.)
Administering MySQL or similar relational databases, including query optimization
Linux system administration — log spelunking, shell scripting, systemd, cron
Working knowledge of Azure or AWS best practices
Information security fundamentals (OWASP Top 10, IAM, threat modeling)