CrowdStrike is a global leader in cybersecurity, dedicated to stopping breaches and redefining modern security with its AI-native platform. The role involves working within the Embedded Reliability team to enhance the reliability of CrowdStrike's production systems, focusing on solving complex distributed systems problems and driving architectural improvements.
Responsibilities:
- Partner with engineering leadership to define and drive reliability roadmaps
- Design and implement architectural improvements to services, libraries, and platforms that impact teams across CrowdStrike
- Establish foundational observability practices: ensure teams instrument services properly, react to signals effectively, and leverage observability to drive automation like continuous delivery
- Lead performance and cost optimization: profiling, bottleneck analysis, capacity planning, and efficiency improvements across cloud infrastructure
- Define and implement service-level objectives that drive decision-making and prioritization
- Conduct resilience engineering: chaos experiments, failure injection, and designing for graceful degradation
- Provide technical leadership during complex incidents and drive systemic improvements
- Mentor and coach engineers, building a culture of excellence and driving architectural standards across the organization
Requirements:
- 7-10+ years building and operating distributed systems at scale
- Expert-level proficiency in at least one programming language; willingness to become proficient in Go
- Deep understanding of distributed systems: e.g. consensus algorithms, replication, consistency, failure modes, scalability patterns
- Proven experience scaling backend systems: e.g sharding, partitioning, horizontal scaling, capacity planning, performance optimization
- Track record of making impactful architectural decisions and seeing them through to production
- Strong systems thinking and ability to influence without direct authority across organizational boundaries
- Degree in Computer Science or equivalent experience in data structures/algorithms/distributed systems
- Experience driving reliability improvements in organizations with hundreds or thousands of microservices
- Deep knowledge of Kubernetes, cloud platforms, or other large-scale orchestration systems
- Experience with AWS, Cassandra, Kafka, OpenSearch, or similar large-scale distributed systems
- Track record of building internal platforms or tools that other engineers use
- Experience in infrastructure cost optimization at scale
- Background in performance engineering: profiling, optimization, understanding system bottlenecks
- Experience with chaos engineering or resilience testing practices
- History of establishing SLO/SLI frameworks and error budgets in production environments
- Background in cybersecurity or intelligence fields
- Experience building developer platforms or improving developer experience