Own reliability outcomes for Tango’s cloud platform (availability, latency, performance, and scalability) across production and non-production environments
Design, implement, and operate SLOs/SLIs, error budgets, and reliability reporting; drive prioritization of reliability work with Engineering and Product
Build and maintain observability foundations: metrics, logging, tracing, dashboards, and alerting that are actionable and reduce noise
Lead incident response and post-incident reviews (blameless RCAs); implement remediation and prevention work to measurably reduce repeat incidents
Engineer and evolve CI/CD and release safety practices (progressive delivery, canary/blue-green, automated rollbacks, change controls)
Improve infrastructure-as-code and environment consistency; standardize and harden platform components
Partner with Security and Compliance to support secure operations, vulnerability remediation, audits, and customer trust requirements
Optimize cloud cost and capacity through right-sizing, autoscaling, and performance tuning; track and report on cost drivers
Enable engineering teams with reliable internal tooling, runbooks, and self-service operational capabilities
Mentor engineers on reliability best practices, operational excellence, and automation
Requirements
8+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering supporting distributed SaaS applications
Strong background in Linux systems engineering, networking fundamentals (TCP/IP, DNS, load balancing), and troubleshooting in production
Proficiency with at least one programming language used for automation (e.g., Python, Go, or Java) and strong scripting skills
Hands-on experience with cloud infrastructure (AWS, Azure, or GCP)
Deep experience with infrastructure-as-code and configuration management (e.g., Terraform, CloudFormation, Ansible)
Expertise in containerization and orchestration (Docker, Kubernetes) and operating cloud-native services
Strong observability practice with tools such as Prometheus/Grafana, Datadog, New Relic, OpenTelemetry, ELK/Splunk, or equivalent
Demonstrated incident management leadership, root cause analysis, and continuous improvement mindset
Experience designing and operating CI/CD pipelines and release management practices (e.g., GitHub Actions, Jenkins, GitLab CI, ArgoCD)
Ability to work cross-functionally with Engineering, Product, Support, and Security; clear written and verbal communication
Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
Relevant certifications are a plus (e.g., AWS/Azure/GCP, Kubernetes CKA/CKAD, ITIL, or security-focused certifications)
Tech Stack
Ansible
AWS
Azure
Cloud
DNS
Docker
Google Cloud Platform
Grafana
Java
Jenkins
Kubernetes
Linux
Prometheus
Python
Splunk
TCP/IP
Terraform
Go
Benefits
Competitive Compensation
Comprehensive Benefits Including health, dental, and vision insurance