Kraken is a technology company focused on creating a smart, sustainable energy system. They are seeking a Senior Platform Engineer to contribute to the availability, performance and scalability of products across Kraken, helping engineering teams improve monitoring and alerting across their product suite.
Responsibilities:
- Support and implement monitoring and alerting strategy across Kraken’s customer business
- Define and uphold observability best practices across multiple products and platforms
- Partner with product teams to implement observability tooling and improve reliability across the organisation
- Help product teams build best-in-class dashboards for their requirements or bespoke use cases
- Work with product teams to define and implement meaningful Service Level Objectives (SLOs) and Service Level Indicators (SLIs), aligned to contractual Service Level Agreements (SLAs)
- Build, tune, and continuously improve alerts and monitors using golden signals (latency, traffic, errors, saturation) as a framework - reducing noise and increasing actionable signal
- Help product teams transition to on-call models by improving signals, alert quality, and operational readiness
- Improve tooling and self-service capabilities for alerting and monitoring across multiple product teams
- Analyse incident metrics to identify trends and improvement opportunities, communicating insights clearly back to product teams
- Manage the cost and usage of our observability tooling stack in collaboration with FinOps
- Contribute to broader platform reliability infrastructure improvements where needed
- Help solve interesting and difficult problems - there’s a significant opportunity for disruption in the global energy market
Requirements:
- Solid hands-on experience across our core platform stack: AWS (supporting and improving cloud infrastructure used by product teams)
- Terraform (infrastructure as code; comfortable operating with Terraform day-to-day)
- Kubernetes (container orchestration and deployment management; comfortable working with Kubernetes day-to-day)
- Experience using industry-standard observability tooling - we use Datadog, Grafana, Prometheus and Rootly (experience with other monitoring/alerting platforms is transferable)
- Strong collaboration and communication skills - able to work effectively with developers, product managers, and other stakeholders to design and deliver impactful observability 'golden paths' and monitoring experiences
- Exposure to Python (or a similar C-based language like TypeScript, Go, C#) - able to understand how applications behave in production to support observability and reliability improvements
- Previous experience working in small, highly autonomous teams
- A working style that fits how we operate: Comfortable with ambiguity and able to create structure in unclear situations
- Proactive learning mindset (experiment, iterate, and adapt as the team evolves approaches)
- Strong asynchronous written communication (Slack/Notion/docs) and a habit of keeping others in the loop
- Autonomy and accountability - making progress independently and owning outcomes
- Previous experience working in a data-focused or Observability team
- Experience working on SaaS platforms, including engaging product teams to ensure upskilling and knowledge sharing
- Experience building observability tooling to support large-scale internet-facing services
- Experience instrumenting and diagnosing issues with very large relational databases
- Familiarity with PostgreSQL (or similar RDBMS), particularly Amazon RDS at scale
- Experience using SLOs to drive meaningful performance and reliability improvements