CrowdStrike is a global leader in cybersecurity, dedicated to stopping breaches and redefining modern security with its AI-native platform. The role involves working within the Embedded Reliability team to enhance the reliability of CrowdStrike's production systems, focusing on solving complex distributed systems problems and driving architectural improvements.

Responsibilities:

Partner with engineering leadership to define and drive reliability roadmaps
Design and implement architectural improvements to services, libraries, and platforms that impact teams across CrowdStrike
Establish foundational observability practices: ensure teams instrument services properly, react to signals effectively, and leverage observability to drive automation like continuous delivery
Lead performance and cost optimization: profiling, bottleneck analysis, capacity planning, and efficiency improvements across cloud infrastructure
Define and implement service-level objectives that drive decision-making and prioritization
Conduct resilience engineering: chaos experiments, failure injection, and designing for graceful degradation
Provide technical leadership during complex incidents and drive systemic improvements
Mentor and coach engineers, building a culture of excellence and driving architectural standards across the organization

Requirements:

7-10+ years building and operating distributed systems at scale
Expert-level proficiency in at least one programming language; willingness to become proficient in Go
Deep understanding of distributed systems: e.g. consensus algorithms, replication, consistency, failure modes, scalability patterns
Proven experience scaling backend systems: e.g sharding, partitioning, horizontal scaling, capacity planning, performance optimization
Track record of making impactful architectural decisions and seeing them through to production
Strong systems thinking and ability to influence without direct authority across organizational boundaries
Degree in Computer Science or equivalent experience in data structures/algorithms/distributed systems
Experience driving reliability improvements in organizations with hundreds or thousands of microservices
Deep knowledge of Kubernetes, cloud platforms, or other large-scale orchestration systems
Experience with AWS, Cassandra, Kafka, OpenSearch, or similar large-scale distributed systems
Track record of building internal platforms or tools that other engineers use
Experience in infrastructure cost optimization at scale
Background in performance engineering: profiling, optimization, understanding system bottlenecks
Experience with chaos engineering or resilience testing practices
History of establishing SLO/SLI frameworks and error budgets in production environments
Background in cybersecurity or intelligence fields
Experience building developer platforms or improving developer experience

Sr. Software Engineer, Product SRE

Key skills

About this role

Responsibilities:

Requirements: