Twilio is a company focused on revolutionizing communications and fostering a remote-first work culture. They are seeking a Reliability Architect to drive the technical strategy and ensure the reliability of Twilio products worldwide by defining standards and guiding engineering teams on best practices.
Responsibilities:
- Partner with senior technical leaders across Twilio to set and communicate the reliability strategy, translating business goals into measurable outcomes
- Influence company-wide architectural decisions while balancing long-term vision with near-term and compliance needs
- Lead the design, implementation, and operation of scalable solutions and paved roads that enable reliable, high-traffic services
- Influence company-wide architectural decisions to focus on availability, performance, resilience, and cost efficiency using Kubernetes, AWS, Terraform, and modern observability
- Ensure integrity and quality across the service lifecycle; design fault-tolerant architectures, incident response, disaster recovery, and capacity/cost management
- Collaborate with product and cross-functional teams to identify reliability risks and convert them into actionable designs, programs, and tooling
- Establish and champion reliability practices and drive systemic improvements
- Mentor and grow engineers and technical leaders
- Track and apply emerging SRE, cloud, and large-scale systems best practices; introduce pragmatic innovations that improve reliability at scale
Requirements:
- 15+ years of experience in Reliability Engineering, Software Engineering, DevOps roles with a focus on infrastructure, backend systems, and reliability, including as a principal/architect
- Strong experience in driving strategic technical decisions and defining long-term technical vision
- In-depth understanding of the role of Reliability Engineering in a large and diverse SaaS organization
- Experience driving cross-org technical architecture outcomes
- Knowledge of cloud architecture, devops practices, and large-scale systems design with microservices
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent experience)
- Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability in high-scale environments
- Hands-on experience with Kubernetes (e.g., EKS), deploying and managing stateful services, and cloud services like AWS
- Proficiency in infrastructure-as-code tools such as Terraform or CloudFormation for automating infrastructure
- Expertise in observability tools (e.g., Prometheus, Grafana, Datadog) for monitoring distributed systems and setting up alerting
- Proficient in at least one programming language (e.g., Go, Python, Java) for building automation and tooling
- Experience designing incident response processes, SLOs/SLIs, runbooks, and participating in on-call rotations
- Experience running cross-functional post-incident reviews and driving improvements
- Strong understanding of distributed systems principles, including consensus, durability, throughput, and availability tradeoffs
- Proven track record of leading reliability improvements in data-intensive or mission-critical systems and collaborating with engineering teams
- Excellent problem-solving, analytical, verbal, and written communication skills, with the ability to work in cross-functional and distributed environments
- Demonstrated leadership in mentoring teams, influencing decisions, and balancing long-term objectives with short-term needs
- Ability to influence and build effective working relationships with all levels of the organization
- Specific experience owning and operating large AWS footprints
- Knowledge of Kubernetes architecture and concepts
- Experience with data technologies like Apache Kafka, AWS MSK, or similar for reliable streaming
- Passion for building reliable products, with prior projects in high-availability systems