Take on ambiguous reliability, scalability, and efficiency challenges and drive solutions across SRE and development teams.
Build and run large-scale, massively distributed, fault-tolerant systems that keep Genesis platform reliable and performant for our customers.
Optimize existing systems, build infrastructure, and eliminate toil through automation to continuously improve uptime and rate of change.
Cultivate a culture of reliability throughout the organization, guiding technical decisions that balance system health with fast-moving product priorities.
Ensure the long-term health, maintainability, and reliability of services through capacity planning, performance analysis, and proactive incident prevention.
Requirements
Strong software engineering skills (e.g., in Python, Go, or similar) with extensive experience designing, analyzing, and troubleshooting distributed systems.
Deep expertise with cloud computing platforms (e.g., Kubernetes, Cloud Functions) and Non-Abstract Large Systems Design (NALSD).
Experience leading complex, large-scale technical projects and providing technical leadership across teams.
Ability to apply coding, algorithms, and complexity analysis to solve ambiguous problems at scale with minimal disruption.
A collaborative, intellectually curious mindset — comfortable working across a wide variety of backgrounds and bringing cross-team perspective to build robust, reusable solutions.