Cribl is a company focused on building telemetry infrastructure for the AI era, partnering with major enterprises to manage and analyze telemetry data. They are seeking a Senior Software Engineer to join their Storage team, responsible for designing and building scalable storage infrastructure on AWS, focusing on automation and self-healing systems.

Responsibilities:

Design and build autoscaling systems for storage clusters — automated provisioning, scale-up/scale-down policies, cluster rebalancing, and node lifecycle management
Own the infrastructure-as-code stack (Terraform) that defines and deploys storage infrastructure end-to-end on AWS
Build self-healing automation: health checks, automated failover, capacity rebalancing, and remediation controllers that resolve issues before they page anyone
Develop the CI/CD pipelines and deployment tooling for storage services — safe rollouts, canary deployments, automated rollback
Design and implement observability for the entire storage platform — metrics, dashboards, SLOs, alerting, and capacity forecasting that drive automated scaling decisions
Own cluster management tooling: provisioning new tenants, managing cluster topology, coordinating upgrades and migrations with zero downtime
Drive performance and cost optimization across the storage data path: ingest pipelines, compaction, partitioning, and query execution
Partner with product engineering to define scalability limits, load test new features, and harden the system for production readiness
Contribute to incident response and lead blameless post-mortems, turning operational surprises into systemic automation
This position will require stand-by, on-call, or off-hours duties

Requirements:

Significant experience building platform/infrastructure systems that manage, scale, and operate distributed services autonomously — not just using infrastructure, but building the layer that automates it
Strong software engineering skills in TypeScript/Node.js, Go, or similar languages — you write controllers, operators, and automation, not runbooks
Deep hands-on experience with infrastructure-as-code (Terraform) and AWS services (EC2, ECS/EKS, ASGs, DynamoDB, S3, CloudWatch)
Experience designing and implementing autoscaling systems, cluster orchestration, or automated provisioning for stateful workloads
Track record operating data-intensive systems at scale — OLAP databases, NoSQL stores, or distributed storage platforms
Strong platform engineering fundamentals: SLOs, error budgets, capacity planning, incident response, and a bias toward eliminating toil through software
Comfortable working with high autonomy in a remote, distributed team and communicating effectively across engineering disciplines
Strong understanding of Linux systems, networking, and performance profiling at the infrastructure level
Experience with DynamoDB or similar NoSQL databases at high throughput — partition design, capacity management, GSI optimization
Background in cluster management for OLAP or analytical databases — automated provisioning, rolling upgrades, replication topology
Experience with object storage and data lake architectures (S3, Parquet/ORC formats)
Knowledge of data pipeline optimization: batching strategies, write amplification reduction, partition pruning, compaction policies
Background in capacity planning, cost optimization, and resource forecasting for storage-heavy workloads on AWS
Experience building internal platforms or developer tooling that other engineers consume (deployment frameworks, service provisioning, self-service infrastructure)
Opinions about what makes a great on-call experience and a track record of making on-call better for everyone

Sr Software Engineer, Storage

Key skills

About this role

Responsibilities:

Requirements: