Flock Safety is the leading safety technology platform, helping communities thrive by taking a proactive approach to crime prevention and security. They are seeking an experienced Site Reliability Engineer who will be responsible for designing and building systems, tooling, and processes to provide a scalable and observable platform while empowering development teams to manage their application stack.
Responsibilities:
- Designing and building systems, tooling, and processes to provide an extensible, scalable, and observable platform
- Empowering development teams to own and manage their full application stack, minimizing bottlenecks and optimizing development velocity without compromising on reliability
- Ensuring the system is running and in line with internal SLIs and SLOs
- Refining CI/CD processes to ensure new code is pushed to production in a reliable and efficient manner
- Collaborating on creating a robust monitoring platform for services and their underlying infrastructure, aiming to alert on symptoms and not outages
- Improving coverage of observability within the Aerodome platform through review and improvement of metrics/logs/traces/profiling, creating or updating dashboards and documents as needed
- Contributing to project development in a supportive role, advising software engineering teams of best practices that drive software engineering decisions
- Improving the CI pipeline, removing inefficiencies and speeding up developer feedback loops
- Creating or improving existing Helm charts to automate and structure
- Improving automation of the Aviation infrastructure
- Contributing to the Platform Engineering efforts, building additional self-service ability to increase developer efficiency
- Assisting in building/integrating IoT management tooling (for drones, docks, radars)
- Ensuring self-healing and auto scaling strategies are implemented with high availability and cost resource usage efficiency
- Standardizing and/or refactoring Terraform modules
- Participating in security reviews, remediation, and new implementations for best in class security posture and compliance
Requirements:
- Experience in an SRE role with an understanding of monitoring, troubleshooting, and disaster recovery
- Extensive experience in writing production-quality code
- Proficiency with infrastructure as code and/or configuration management (we use Terraform)
- Experience with managing monitoring dashboards using tools like Grafana and Prometheus to create actionable alerts
- Ability to ensure the system is running and in line with internal SLIs and SLOs
- Experience refining CI/CD processes to ensure new code is pushed to production in a reliable and efficient manner
- Ability to collaborate on creating a robust monitoring platform for our services and their underlying infrastructure, aiming to alert on symptoms and not outages
- Familiarity with best practices when creating and managing AWS resources (e.g. security groups, VPCs)
- Ability to obtain and maintain Criminal Justice Information Services (CJIS) certification as a condition of employment
- Applicants must meet all FBI CJIS Security Policy requirements, including a fingerprint-based background check