Kraken is a mission-focused company rooted in crypto values, aiming to accelerate the global adoption of crypto. As a Senior Site Reliability Engineer specialized in Data Infrastructure, you will ensure the reliability, scalability, and efficiency of the data platform while collaborating with cross-functional teams.
Responsibilities:
- Design the data governance mechanisms that ensure our lakehouse is easy to interact with, secure and in compliance with all applicable regulations
- Implement the infrastructure we use to ingest our data, store it, catalog it with the right metadata and capture its lineage
- Provide a state-of-the-art suite of BI tools for multiple teams within the company
- Guarantee the availability, high performance, scalability and cost efficiency of our data platform
- Implement data infrastructure solutions (self service) that support the needs of 10+ business units and over 100 engineering and data analysts
- Utilize Infrastructure as Code (IaC) principles to design, provision, and manage both on-premises and cloud (AWS) infrastructure components using tools such as Terraform
- Develop and maintain automation scripts using bash/shell scripting and to automate operational tasks and deployments
- Enhance and manage CI/CD pipelines to facilitate consistent software deployments across the data infrastructure
- Implement robust data monitoring and alerting solutions to proactively detect anomalies and performance issues
- Manage and implement role-based access control (RBAC) and permissions for a multitude of user groups and machine workflows across different environments
- Manage and maintain real-time streaming data architecture using technologies like Kafka and Debezium Change Data Capture (CDC)
- Ensure the timely and accurate processing of streaming data, enabling data analysts and engineers to gain insights from up-to-date information
- Utilize Kubernetes to manage containerized applications within the data infrastructure, ensuring efficient deployment, scaling, and orchestration
- Implement effective incident response procedures and participate in on-call rotations
- Collaborate with data analysts, engineers, and cross-functional teams to understand requirements and implement appropriate solutions
- Document architecture, processes, and best practices to enable knowledge sharing and support continuous improvement
- Support AI/ML teams with their infra requests
Requirements:
- Proven experience (5+ years) working as a Site Reliability Engineer, Infrastructure Engineer, Data Infrastructure Engineer, or similar roles, with a focus on data infrastructure and security
- Experience with maintaining real-time data processing technologies, such as Kafka and Flink clusters and Debezium instances
- Working experience in managing hybrid multi-tenant cloud systems particularly on AWS
- Infrastructure as Code tools such as Terraform, Terragrunt and Atlantis
- Experience with containerization and orchestration tools, particularly Kubernetes, Nomad, and Docker
- Solid understanding of bash/shell scripting and proficiency in at least one programming language (preferably Python or JVM languages)
- Experience maintaining data-related technologies: Apache Airflow, Apache Spark, DBs, BI tooling
- Experience solving data access management issues at large scale data-lake
- Familiarity with CI/CD deployment pipelines and related tools
- Strong problem-solving skills and the ability to troubleshoot complex systems
- Experience with data-related technologies (databases, data lakes, airflow, spark) is a plus