Career Soft Solutions Inc is seeking a Staff Platform Engineer specializing in DevOps and SRE for their cloud infrastructure. The role involves overseeing the operational reliability and deployment automation of a Temporal-based claims processing platform on Google Kubernetes Engine, while also mentoring other engineers and leading architectural decisions.
Responsibilities:
- Own the GKE cluster architecture: regional private cluster, autoscaling node pools, network policies, Pod Disruption Budgets, and ingress configuration
- Design hybrid networking from GCP to on-prem systems, including Cloud VPN/Interconnect strategy, VPC peering for Cloud SQL, and DNS resolution patterns
- Lead architectural decisions for resiliency, cost efficiency, and capacity, including node sizing, autoscaling on custom metrics, and committed-use discount strategy
- Champion Infrastructure as Code (Terraform) and CI/CD pipelines for containerized workloads, including image scanning, signing, and progressive rollout
- Own secrets management and runtime resolution of auth profiles for downstream systems, integrating with the CVS-approved secrets backend
- Operate the observability stack end-to-end: Managed Prometheus for metrics, Grafana dashboards, OpenTelemetry tracing to Cloud Trace, and structured logging to Cloud Logging
- Define and operate the SRE practice: SLIs/SLOs, error budgets, on-call rotations, incident response runbooks, post-mortems, and resilience testing
- Partner with Security on HIPAA-aligned controls: private cluster configuration, internal-only load balancers, IAP for internal applications, and audit logging
- Mentor senior and mid-level engineers on cloud-native operations and SRE discipline; lead design and code reviews for infrastructure changes; influence engineering direction across teams
Requirements:
- Multiple years of experience in DevOps, Site Reliability Engineering, or cloud platform engineering for production systems
- Deep production expertise with Kubernetes (GKE strongly preferred): node pool design, autoscaling, network policies, Helm, and workload identity
- Strong production experience on Google Cloud Platform, including GKE, Cloud SQL, VPC and hybrid connectivity, Cloud Logging, Cloud Trace, Managed Prometheus, and IAM
- Hands-on expertise with Infrastructure as Code (Terraform required; Helm required) and CI/CD pipelines for containerized workloads
- Strong understanding of high-availability architectures, multi-zone failover, disaster recovery, and RTO/RPO planning
- Proven experience operating large-scale, mission-critical production environments under regulatory or compliance constraints
- Advanced troubleshooting and performance optimization across Kubernetes, Linux, networking, and database layers
- Experience leveraging code generation tools like Copilot to write robust test cases and rapidly prototype features
- Experience collaborating across architecture, security, networking, and application teams
- Experience operating Temporal.io, Cadence, or comparable distributed workflow systems in production
- Hands-on experience operating PostgreSQL at scale (Cloud SQL HA, tuning, backup/PITR, schema migrations)
- Experience with hybrid cloud connectivity to on-prem enterprise systems via Cloud VPN or Dedicated/Partner Interconnect
- Familiarity with Elasticsearch operations (self-managed or Elastic Cloud) for visibility/search workloads
- Familiarity with DevSecOps, SRE, and AIOps practices, including chaos engineering and resilience testing
- Healthcare, regulated industry, or large enterprise experience; familiarity with HIPAA/PHI controls and audit retention requirements