Monitor and maintain the health, performance, and reliability of our hyperscale cell infrastructure processing trillions of events daily
Lead incident response and problem management through established on-call rotations and structured feedback loops
Implement comprehensive monitoring with Service Level Indicators to enable proactive alerting and automated self-healing
Conduct capacity planning and forecasting based on ingest rates and query patterns to optimize resource utilization
Ensure data integrity and compliance across >100 PB of stored data through automated consistency checks and recovery testing
Manage access controls, certificate rotation, and vulnerability management across cell infrastructure according to defined SLAs
Provision and scale cell infrastructure (vertical/horizontal) based on demand and performance requirements
Develop microservices and automation tools for cell components, including ingest writers and management systems
Orchestrate version upgrades, patch management, and configuration changes with minimal customer impact
Perform load testing and performance benchmarking to validate scaling thresholds and optimize costs
Coordinate with fleet operations, product teams, and infrastructure teams on global changes and capacity planning
Create technical documentation, operational playbooks, and partner with teams to address customer-impacting issues
Work in a team of friendly, trustworthy, and knowledgeable colleagues
Build and maintain CI/CD pipelines for testing and releasing configuration and software
Troubleshoot complex issues across multiple large-scale distributed systems, including LogScale, Kafka, object storage systems, and related infrastructure
Work closely with Engineering and Customer Support to troubleshoot time-sensitive production issues, regardless of when they happen
Apply SRE best practices, including SLOs, error budgets, chaos engineering, and blameless post-mortems
Effectively utilize AI coding assistants (e.g., Anthropic Claude) to accelerate development and problem-solving.
Requirements
Proven experience designing and implementing distributed systems with high scalability, availability, and performance optimization at enterprise scale
Experience in contributing to broad technical leadership in products or services
A can-do attitude; you thrive collaborating in a team and are not afraid of taking on responsibilities
Several years' experience with large-scale, business-critical Linux-based environments
Solid grounding in the technology of at least one cloud environment (AWS, Azure, GCP)
Experience working with CI/CD, Jenkins Git, Artifactory, Bitbucket
Go (golang) programming experience in production environments
Some familiarity with Python programming
Experience with configuration management systems such as Chef or Ansible
Availability for on-call on a rotational basis
Bonus Points: Experience with Kafka
Bachelor's degree in an applicable field, such as Computer Science or Engineering.
Tech Stack
Ansible
AWS
Azure
Chef
Cloud
Distributed Systems
Google Cloud Platform
Jenkins
Kafka
Linux
Microservices
Python
Go
Benefits
Market leader in compensation and equity awards
Comprehensive physical and mental wellness programs
Competitive vacation and holidays for recharge
Paid parental and adoption leaves
Professional development opportunities for all employees regardless of level or role
Employee Networks, geographic neighborhood groups, and volunteer opportunities to build connections