Dataminr is a mission-driven company that focuses on providing real-time intelligence through its AI-powered platform. As a Site Reliability Engineer, you will build and maintain tools for software engineers and data scientists, ensuring high-quality software delivery and championing best practices within the engineering organization.
Responsibilities:
- Work on our self service internal developer platform used by engineering teams to deploy containers, serverless functions and cloud resources
- Maintain and improve our observability stack
- Drive improvements in security, reliability, cost efficiency and performance
- Troubleshoot large-scale distributed systems
- Work closely with product engineering teams to enable efficient project delivery
- Support our production environment as part of an on call rota, help with triage and resolution when issues arise
Requirements:
- Experience managing Kubernetes clusters at scale (CKA a bonus)
- Maintaining and hardening AWS infrastructure using Terraform
- Development skills in Python or Go
- Linux systems administration and TCP/IP networking
- Experience maintaining observability tooling e.g. LGTM stack, OpenSearch