Implement SLI/SLO frameworks with error budgets, driving data-informed reliability decisions across the platform
Design release strategies including blue/green deployments, canary releases, automatic rollback, and version tracking
Lead incident response, author post-mortems, and build automated runbooks that reduce MTTR
Develop internal tooling, automation frameworks, and self-service platforms in TypeScript/Python to improve developer productivity and operational efficiency
Write reliability-focused services: health checkers, auto-remediation controllers, capacity managers, deployment orchestrators, and chaos testing frameworks
Build and maintain production AWS infrastructure using IaC (Terraform/CloudFormation), with focus on ECS, EKS/Kubernetes, and microservices orchestration
Build and maintain end-to-end CI/CD pipelines for backend services, mobile apps (iOS/Android), and IoT firmware across on-prem and AWS cloud environments
Define and enforce security policies: network segmentation, IAM, secrets management, encryption, compliance auditing, vulnerability management, and incident response
Build comprehensive observability with OpenTelemetry, distributed tracing, custom metrics exporters, and alerting across WebSocket connections, message delivery pipelines, and real-time communication services
Manage PostgreSQL (RDS), Redis/ElastiCache, SQS, S3, and NLB/ALB configurations including Elastic IPs for SIP/RTP traffic

7+ years in SRE/DevOps/Platform Engineering with a strong software development background
Proficiency in at least one backend language (TypeScript/Node.js, Python, or Go) for building internal tools, CLIs, operators, and automation services
Deep AWS expertise: ECS, EKS, RDS, ElastiCache, SQS, VPC networking, IAM, CloudWatch
Strong IaC proficiency (Terraform, CloudFormation, or Pulumi) including module design, state management, and drift detection
Proven CI/CD pipeline design on both on-prem and cloud (GitHub Actions, CodeBuild/CodePipeline, self-hosted runners)
Container orchestration at scale: Docker, ECS task definitions, Kubernetes, Helm, with experience writing custom controllers or operators
Solid security background: network security, secrets management, compliance, incident response
Experience implementing SLI/SLO frameworks, error budgets, and toil reduction strategies
Production PostgreSQL, Redis, and message queue operations (SQS, Redis Streams)
Strong understanding of distributed systems patterns: circuit breakers, retries, backpressure, graceful degradation.

A role where engineering and operations merge, you'll ship code that keeps the platform running
Technically challenging environment spanning cloud, IoT, telecom, and satellite systems
Full ownership of the infrastructure stack with direct impact on reliability and scale
Competitive compensation, flexible remote work and a great work environment

Senior SRE DevOps Engineer

Key skills