Website LinkedIn

Platform Engineer

United States of America

Full Time

2 hours ago

H1B Sponsor Likely

Key skills

Google Kubernetes EngineKubernetes node pool designKubernetes autoscalingKubernetes network policiesHelmWorkload identityGoogle Cloud PlatformCloud SQLVPCHybrid cloud connectivityCloud LoggingCloud TraceManaged PrometheusIAMTerraformCI/CD pipelinesHigh-availability architecturesMulti-zone failoverDisaster recoveryRTO/RPO planningTemporal.ioPostgreSQLCloud VPNDedicated InterconnectPartner InterconnectElasticsearchDevSecOpsSite Reliability EngineeringAIOpsChaos engineeringResilience testingHIPAA complianceSQLGCPGoogle CloudKubernetesGKEPrometheusGrafanaOpenTelemetryPerformance OptimizationCI/CDMentoring

About this role

Career Soft Solutions Inc is seeking a Staff Platform Engineer specializing in DevOps and SRE for their cloud infrastructure. The role involves overseeing the operational reliability and deployment automation of a Temporal-based claims processing platform on Google Kubernetes Engine, while also mentoring other engineers and leading architectural decisions.

Responsibilities:

Own the GKE cluster architecture: regional private cluster, autoscaling node pools, network policies, Pod Disruption Budgets, and ingress configuration
Design hybrid networking from GCP to on-prem systems, including Cloud VPN/Interconnect strategy, VPC peering for Cloud SQL, and DNS resolution patterns
Lead architectural decisions for resiliency, cost efficiency, and capacity, including node sizing, autoscaling on custom metrics, and committed-use discount strategy
Champion Infrastructure as Code (Terraform) and CI/CD pipelines for containerized workloads, including image scanning, signing, and progressive rollout
Own secrets management and runtime resolution of auth profiles for downstream systems, integrating with the CVS-approved secrets backend
Operate the observability stack end-to-end: Managed Prometheus for metrics, Grafana dashboards, OpenTelemetry tracing to Cloud Trace, and structured logging to Cloud Logging
Define and operate the SRE practice: SLIs/SLOs, error budgets, on-call rotations, incident response runbooks, post-mortems, and resilience testing
Partner with Security on HIPAA-aligned controls: private cluster configuration, internal-only load balancers, IAP for internal applications, and audit logging
Mentor senior and mid-level engineers on cloud-native operations and SRE discipline; lead design and code reviews for infrastructure changes; influence engineering direction across teams

Requirements:

Multiple years of experience in DevOps, Site Reliability Engineering, or cloud platform engineering for production systems
Deep production expertise with Kubernetes (GKE strongly preferred): node pool design, autoscaling, network policies, Helm, and workload identity
Strong production experience on Google Cloud Platform, including GKE, Cloud SQL, VPC and hybrid connectivity, Cloud Logging, Cloud Trace, Managed Prometheus, and IAM
Hands-on expertise with Infrastructure as Code (Terraform required; Helm required) and CI/CD pipelines for containerized workloads
Strong understanding of high-availability architectures, multi-zone failover, disaster recovery, and RTO/RPO planning
Proven experience operating large-scale, mission-critical production environments under regulatory or compliance constraints
Advanced troubleshooting and performance optimization across Kubernetes, Linux, networking, and database layers
Experience leveraging code generation tools like Copilot to write robust test cases and rapidly prototype features
Experience collaborating across architecture, security, networking, and application teams
Experience operating Temporal.io, Cadence, or comparable distributed workflow systems in production
Hands-on experience operating PostgreSQL at scale (Cloud SQL HA, tuning, backup/PITR, schema migrations)
Experience with hybrid cloud connectivity to on-prem enterprise systems via Cloud VPN or Dedicated/Partner Interconnect
Familiarity with Elasticsearch operations (self-managed or Elastic Cloud) for visibility/search workloads
Familiarity with DevSecOps, SRE, and AIOps practices, including chaos engineering and resilience testing
Healthcare, regulated industry, or large enterprise experience; familiarity with HIPAA/PHI controls and audit retention requirements