Kraken is a mission-focused company rooted in crypto values, aiming to accelerate the global adoption of crypto. As a Senior Site Reliability Engineer, you will ensure the reliability, scalability, and performance of systems that support Kraken's growth initiatives, collaborating with development teams and managing infrastructure.
Responsibilities:
- Manage and support infrastructure for Growth teams, including Nomad, Hashistack, databases, and any other underlying systems
- Maintain and troubleshoot GitLab CI pipelines, ensuring reliable and fast build, test, and deployment cycles
- Provide operational support across Onboarding, Acquire, and Engage teams, helping debug issues in staging and production environments
- Participate in incident response and post-incident reviews to improve system resilience
- Consult with teams on performance, monitoring, and alerting best practices
- Build tooling, automation, and dashboards to improve observability and empower development teams
- Collaborate with developers, QA, and product managers to streamline development and release cycles
- Support a fully distributed team operating across multiple timezones
Requirements:
- 5+ years in a DevOps or SRE role
- Strong experience managing infrastructure with Consul, Vault, and Terraform
- Proficiency with databases (SQL and NoSQL) and experience operating them in production
- Proficient in Git source version-control and CI/CD configuration
- Deep understanding of monitoring and alerting systems, preferably Prometheus and Grafana
- Ability to debug complex issues involving distributed systems, networks, and Linux operating systems
- Experience with containerization and orchestration (Docker, Nomad, Kubernetes a plus)
- Strong scripting skills (e.g., Bash, Python, or Go)
- Self-starter with the ability to thrive while working independently and remotely in a fast-paced environment
- Ability to collaborate effectively with multiple teams and switch context across projects
- Interest in security and consideration of the security implications of development and operational decisions
- Experience with benchmarking, performance tuning, and identifying system bottlenecks
- Familiarity with incident management best practices and tooling
- Interest in lower-level programming languages such as Rust
- Experience integrating with APIs (GitLab, Jira, Slack)
- Background working with distributed systems and technologies (Kafka, gRPC, Redis, etc.)
- Passion for building reliable, user-facing systems that scale