AWSGrafanaKafkaKubernetesPrometheusPythonSparkEKSHelmAWS CDKS3RDSIAMCloudWatchCI/CDRemote Work
About this role
Role Overview
Designs, implements, and continuously improves observability strategies across services, including metrics, logs, traces, alerts, and dashboards.
Focuses on understanding system behavior in production, identifying failure modes, performance bottlenecks, and reliability risks.
Evolves and maintains shared AWS CDK and CDK8s constructs, with emphasis on observability, autoscaling, and operational safeguards rather than basic infrastructure provisioning.
Maintains and operates core platform components such as VPC, EKS clusters, RDS, OpenSearch, and MSK, ensuring they expose meaningful operational signals.
Operates and enhances Kubernetes cluster addons such as ingress controllers, cert-manager, autoscalers, and monitoring, logging, and tracing stacks.
Defines and maintains SLIs, SLOs, and alerting strategies that clearly distinguish between symptoms, root causes, and actionable operational events.
Improves automated operational responses, including autoscaling, self-healing mechanisms, and runbook-driven remediation.
Ensures high reliability through structured alerting systems (Prometheus, CloudWatch), noise reduction, alert quality improvements, and recovery mechanisms.
Collaborates with engineering teams to investigate production incidents, perform root cause analysis, and drive long-term reliability improvements.
Owns CI/CD pipelines for Infrastructure as Code (IaC) and observability-related platform components.
Applies Site Reliability Engineering (SRE) principles—including observability-first design, error budgets, and operational readiness—to shared platform services.
Supports IAM roles, secrets management, and tenant isolation best practices.
Requirements
Has 5+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure roles, with significant hands-on experience operating and supporting production systems.
Demonstrates strong experience in observability operations, including defining metrics, logs, traces, dashboards, alerts, and reliability indicators for complex systems.
Has hands-on experience with AWS services such as VPC, IAM, RDS, MSK, S3, and CloudWatch, as well as Kubernetes components like Helm, RBAC, and ServiceAccounts.
Demonstrates fluency in Python and experience with Infrastructure-as-Code using AWS CDK, CDK8s, or equivalent frameworks.
Possesses a strong understanding of Prometheus, Grafana, alert tuning, alert fatigue reduction, and incident-driven monitoring improvements.
Has experience improving existing systems rather than building greenfield infrastructure, with a focus on operational excellence and system reliability.
Shows a proven track record of using observability data to drive automation, scaling decisions, and operational improvements.
Has experience designing reusable infrastructure or observability patterns, or contributing to internal developer or platform tooling.
Has experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines (nice to have).
Tech Stack
AWS
Grafana
Kafka
Kubernetes
Prometheus
Python
Spark
Benefits
100% Remote Work: Enjoy the freedom to work from the location that helps you thrive. All it takes is a laptop and a reliable internet connection.
Highly Competitive USD Pay: Earn an excellent, market-leading compensation in USD, that goes beyond typical market offerings.
Paid Time Off: We value your well-being. Our paid time off policies ensure you have the chance to unwind and recharge when needed.
Work with Autonomy: Enjoy the freedom to manage your time as long as the work gets done. Focus on results, not the clock.
Work with Top American Companies: Grow your expertise working on innovative, high-impact projects with Industry-Leading U.S. Companies.