Chainlink Labs is the industry-standard oracle platform powering decentralized finance (DeFi). The Senior Site Reliability Engineer will design and build infrastructure primitives to ensure reliability and scalability of Chainlink's decentralized oracle networks, focusing on Kubernetes-based control planes and automation.
Responsibilities:
- Design and build the infrastructure primitives that define how Chainlink Decentralized Oracle Networks (DONs) scale across internal systems and the decentralized ecosystem
- Help create the CRE (Kubernetes-based) control plane that enables: Deterministic horizontal scaling of DONs, Safe and repeatable infrastructure expansion, Improved operational efficiency and scalability
- Develop the core infrastructure components, including Kubernetes Operators and scaling automation, that Product teams will adopt and then might later be distributed to external node operators to improve decentralized scaling
- Build the systems that define how Chainlink scales while shaping the reliability, scalability, and decentralization of protocol-level services
Requirements:
- 6–9+ years in SRE / Platform / Infrastructure Engineering
- Proven experience scaling Kubernetes in high-throughput production environments
- Deep knowledge of: Scheduler behavior, StatefulSets & persistent workloads, Autoscaling strategies (HPA, VPA, KEDA, custom scaling), Resource management & performance tuning, Multi-cluster and multi-region architectures
- Experience in diagnosing production failures at the cluster scale
- Strong Terraform or Crossplane experience
- GitOps workflows (ArgoCD / Flux) experience
- CI/CD reliability experience
- Automation-first mindset
- AWS production experience
- Proficiency in Go (strongly preferred) or equivalent systems language
- Experience with web3 concepts (e.g. blockchain node lifecycle, forks, reorgs, or RPC issues)
- Experience with oracle systems, token architectures, or decentralized services
- Experience scaling stateful high-availability distributed systems
- Experience building internal platform primitives
- Experience implementing custom autoscaling logic
- Experience designing SLO strategies and error-budget usage
- Experience improving diagnosability and observability frameworks
- Experience working in high-ambiguity environments
- Experience operating blockchain infrastructure in production
- Certified Kubernetes Administrator (CKA)
- Experience contributing to Kubernetes ecosystem projects
- Experience building multi-tenant platform infrastructure
- Experience working in high-security and/or SOC 2/ISO27001 compliant environments
- Experience with chaos engineering practices or implementation