Kraken is a technology company focused on creating a smart, sustainable energy system. As a Lead Site Reliability Engineer, you'll ensure the availability, performance, and scalability of products on the platform while leading a technical team to support millions of customers.
Responsibilities:
- Team leadership
- Have ownership of the Product Reliability team within Platform, working closely with the Director and Heads of Platform Engineering to define strategic objectives and team direction
- Manage team priorities and ensure initiatives are completed within deadlines
- Collaborate regularly and effectively with the Staff Platform Engineer in your functional team to deliver the technical implementation of the team’s strategic priorities
- Lead delivery of major initiatives on clear timelines
- Partner effectively in the wider Platform Engineering team to deliver outcomes
- Build a strong culture of open communication where teammates can ask questions without fear, promoting a positive and inclusive team environment
- People management
- Line-manage the engineers in the Product Reliability team
- Set clear performance expectations and goals for team members
- Regularly review individual and team performance, offering actionable insights and constructive feedback to support and grow team members
- Technical delivery
- Deliver technical improvements such as small features and bug fixes
- Support team delivery through code reviews, technology research and architectural guidance
- Provide support for service offerings owned by your team
- Help solve interesting and difficult problems. There’s a great opportunity for disruption in the global energy market
Requirements:
- Excellent communication skills, working effectively with developers, product managers and other business stakeholders to understand and deliver impactful projects and reliability improvements
- Record of successfully and consistently delivering critical path projects, on time and at scale
- Meticulous organisation and planning skills
- Experience of mentoring and coaching a team to perform at a high-level of quality
- Experience managing and supporting a large-scale internet-facing distributed systems, for millions of customers
- Good experience with AWS and a programming language. We use a lot of different AWS services and not just the standard few
- Knowledge of security best-practices, security and CI/CD tooling, and methodologies
- Previous experience in leading technical delivery for small, highly-autonomous teams
- Previous experience as a technical individual contributor, preferably as a Site Reliability Engineer
- Track-record of effective collaboration with other teams and departments to drive holistic outcomes
- A proactive, innovative mindset with the ability to drive continuous improvement
- Previous experience working in a remote-first asynchronous global team
- Familiarity with some of our tech stack: PostgreSQL, or a similar RDBMS, particularly in Amazon RDS at scale
- Docker and Kubernetes, we use Amazon EKS in production
- Python
- Datadog, or a similar logging/monitoring tool
- Messaging queues, event-driven async processing or similar technologies - we use RabbitMQ
- Terraform, or a similar infrastructure-as-code tool
- Experience with a Linux distribution