Job Title: Staff Platform Engineer DevOps / SRE (GKE & Cloud Infrastructure)
Location - Remote
Role Overview
Senior individual contributor responsible for the cloud infrastructure, deployment automation, observability, and operational reliability of the Temporal-based claims processing platform on Google Kubernetes Engine. Owns the cluster topology, Terraform-managed infrastructure, hybrid networking to on-prem systems, secrets management, and the metrics/logs/traces observability stack. Serves as a technical authority for cloud platform reliability, capacity planning, and incident response.
This role supports a strategic platform initiative within Medicare Claims Engineering to migrate the existing Automation Anywhere RPA portfolio onto a modern, code-and-config-driven workflow platform built on Temporal.io, Python/Playwright, and Google Kubernetes Engine (GKE). Workflows are visually authored on a custom React Flow canvas that emits versioned configs executed by Temporal workers. The platform operates under HIPAA governance.
Key Responsibilities
- Own the GKE cluster architecture: regional private cluster, autoscaling node pools, network policies, Pod Disruption Budgets, and ingress configuration.
- Design hybrid networking from Google Cloud Platform to on-prem systems, including Cloud VPN/Interconnect strategy, VPC peering for Cloud SQL, and DNS resolution patterns.
- Lead architectural decisions for resiliency, cost efficiency, and capacity, including node sizing, autoscaling on custom metrics, and committed-use discount strategy.
- Champion Infrastructure as Code (Terraform) and CI/CD pipelines for containerized workloads, including image scanning, signing, and progressive rollout.
- Own secrets management and runtime resolution of auth profiles for downstream systems, integrating with the CVS-approved secrets backend.
- Operate the observability stack end-to-end: Managed Prometheus for metrics, Grafana dashboards, OpenTelemetry tracing to Cloud Trace, and structured logging to Cloud Logging.
- Define and operate the SRE practice: SLIs/SLOs, error budgets, on-call rotations, incident response runbooks, post-mortems, and resilience testing.
- Partner with Security on HIPAA-aligned controls: private cluster configuration, internal-only load balancers, IAP for internal applications, and audit logging.
- Mentor senior and mid-level engineers on cloud-native operations and SRE discipline; lead design and code reviews for infrastructure changes; influence engineering direction across teams.
Required Qualifications
- Multiple years of experience in DevOps, Site Reliability Engineering, or cloud platform engineering for production systems.
- Deep production expertise with Kubernetes (GKE strongly preferred): node pool design, autoscaling, network policies, Helm, and workload identity.
- Strong production experience on Google Cloud Platform, including GKE, Cloud SQL, VPC and hybrid connectivity, Cloud Logging, Cloud Trace, Managed Prometheus, and IAM.
- Hands-on expertise with Infrastructure as Code (Terraform required; Helm required) and CI/CD pipelines for containerized workloads.
- Strong understanding of high-availability architectures, multi-zone failover, disaster recovery, and RTO/RPO planning.
- Proven experience operating large-scale, mission-critical production environments under regulatory or compliance constraints.
- Advanced troubleshooting and performance optimization across Kubernetes, Linux, networking, and database layers.
- Experience leveraging code generation tools like Copilot to write robust test cases and rapidly prototype features.
- Experience collaborating across architecture, security, networking, and application teams.
Preferred Qualifications
- Experience operating Temporal.io, Cadence, or comparable distributed workflow systems in production.
- Hands-on experience operating PostgreSQL at scale (Cloud SQL HA, tuning, backup/PITR, schema migrations).
- Experience with hybrid cloud connectivity to on-prem enterprise systems via Cloud VPN or Dedicated/Partner Interconnect.
- Familiarity with Elasticsearch operations (self-managed or Elastic Cloud) for visibility/search workloads.
- Familiarity with DevSecOps, SRE, and AIOps practices, including chaos engineering and resilience testing.
- Healthcare, regulated industry, or large enterprise experience; familiarity with HIPAA/PHI controls and audit retention requirements.