Dataminr is a mission-driven company that provides AI-powered intelligence solutions. As a Site Reliability Engineer, you will ensure high-quality software delivery by building and maintaining tools for software engineers and data scientists, while also championing best practices within the engineering organization.
Responsibilities:
- Work on our self service internal developer platform used by engineering teams to deploy containers, serverless functions and cloud resources
- Maintain and improve our observability stack
- Drive improvements in security, reliability, cost efficiency and performance
- Troubleshoot large-scale distributed systems
- Work closely with product engineering teams to enable efficient project delivery
- Support our production environment as part of an on call rota, help with triage and resolution when issues arise
Requirements:
- Experience managing Kubernetes clusters at scale (CKA a bonus)
- Maintaining and hardening AWS infrastructure using Terraform
- Development skills in Python or Go
- Linux systems administration and TCP/IP networking
- Experience maintaining observability tooling e.g. LGTM stack, OpenSearch