Cribl is a company focused on building telemetry infrastructure for the AI era, partnering with major enterprises to manage and analyze telemetry data. They are seeking a Senior Software Engineer to join their Storage team, responsible for designing and building scalable storage infrastructure on AWS, focusing on automation and self-healing systems.
Responsibilities:
- Design and build autoscaling systems for storage clusters — automated provisioning, scale-up/scale-down policies, cluster rebalancing, and node lifecycle management
- Own the infrastructure-as-code stack (Terraform) that defines and deploys storage infrastructure end-to-end on AWS
- Build self-healing automation: health checks, automated failover, capacity rebalancing, and remediation controllers that resolve issues before they page anyone
- Develop the CI/CD pipelines and deployment tooling for storage services — safe rollouts, canary deployments, automated rollback
- Design and implement observability for the entire storage platform — metrics, dashboards, SLOs, alerting, and capacity forecasting that drive automated scaling decisions
- Own cluster management tooling: provisioning new tenants, managing cluster topology, coordinating upgrades and migrations with zero downtime
- Drive performance and cost optimization across the storage data path: ingest pipelines, compaction, partitioning, and query execution
- Partner with product engineering to define scalability limits, load test new features, and harden the system for production readiness
- Contribute to incident response and lead blameless post-mortems, turning operational surprises into systemic automation
- This position will require stand-by, on-call, or off-hours duties
Requirements:
- Significant experience building platform/infrastructure systems that manage, scale, and operate distributed services autonomously — not just using infrastructure, but building the layer that automates it
- Strong software engineering skills in TypeScript/Node.js, Go, or similar languages — you write controllers, operators, and automation, not runbooks
- Deep hands-on experience with infrastructure-as-code (Terraform) and AWS services (EC2, ECS/EKS, ASGs, DynamoDB, S3, CloudWatch)
- Experience designing and implementing autoscaling systems, cluster orchestration, or automated provisioning for stateful workloads
- Track record operating data-intensive systems at scale — OLAP databases, NoSQL stores, or distributed storage platforms
- Strong platform engineering fundamentals: SLOs, error budgets, capacity planning, incident response, and a bias toward eliminating toil through software
- Comfortable working with high autonomy in a remote, distributed team and communicating effectively across engineering disciplines
- Strong understanding of Linux systems, networking, and performance profiling at the infrastructure level
- Experience with DynamoDB or similar NoSQL databases at high throughput — partition design, capacity management, GSI optimization
- Background in cluster management for OLAP or analytical databases — automated provisioning, rolling upgrades, replication topology
- Experience with object storage and data lake architectures (S3, Parquet/ORC formats)
- Knowledge of data pipeline optimization: batching strategies, write amplification reduction, partition pruning, compaction policies
- Background in capacity planning, cost optimization, and resource forecasting for storage-heavy workloads on AWS
- Experience building internal platforms or developer tooling that other engineers consume (deployment frameworks, service provisioning, self-service infrastructure)
- Opinions about what makes a great on-call experience and a track record of making on-call better for everyone