Implement SLI/SLO frameworks with error budgets, driving data-informed reliability decisions across the platform
Design release strategies including blue/green deployments, canary releases, automatic rollback, and version tracking
Lead incident response, author post-mortems, and build automated runbooks that reduce MTTR
Develop internal tooling, automation frameworks, and self-service platforms in TypeScript/Python to improve developer productivity and operational efficiency
Write reliability-focused services: health checkers, auto-remediation controllers, capacity managers, deployment orchestrators, and chaos testing frameworks
Build and maintain production AWS infrastructure using IaC (Terraform/CloudFormation), with focus on ECS, EKS/Kubernetes, and microservices orchestration
Build and maintain end-to-end CI/CD pipelines for backend services, mobile apps (iOS/Android), and IoT firmware across on-prem and AWS cloud environments
Define and enforce security policies: network segmentation, IAM, secrets management, encryption, compliance auditing, vulnerability management, and incident response
Build comprehensive observability with OpenTelemetry, distributed tracing, custom metrics exporters, and alerting across WebSocket connections, message delivery pipelines, and real-time communication services
Manage PostgreSQL (RDS), Redis/ElastiCache, SQS, S3, and NLB/ALB configurations including Elastic IPs for SIP/RTP traffic
Requirements
7+ years in SRE/DevOps/Platform Engineering with a strong software development background
Proficiency in at least one backend language (TypeScript/Node.js, Python, or Go) for building internal tools, CLIs, operators, and automation services