Flock Safety is the leading safety technology platform, helping communities thrive by taking a proactive approach to crime prevention and security. They are seeking an experienced Site Reliability Engineer who will be responsible for designing and building systems, tooling, and processes to provide a scalable and observable platform while empowering development teams to manage their application stack.

Responsibilities:

Designing and building systems, tooling, and processes to provide an extensible, scalable, and observable platform
Empowering development teams to own and manage their full application stack, minimizing bottlenecks and optimizing development velocity without compromising on reliability
Ensuring the system is running and in line with internal SLIs and SLOs
Refining CI/CD processes to ensure new code is pushed to production in a reliable and efficient manner
Collaborating on creating a robust monitoring platform for services and their underlying infrastructure, aiming to alert on symptoms and not outages
Improving coverage of observability within the Aerodome platform through review and improvement of metrics/logs/traces/profiling, creating or updating dashboards and documents as needed
Contributing to project development in a supportive role, advising software engineering teams of best practices that drive software engineering decisions
Improving the CI pipeline, removing inefficiencies and speeding up developer feedback loops
Creating or improving existing Helm charts to automate and structure
Improving automation of the Aviation infrastructure
Contributing to the Platform Engineering efforts, building additional self-service ability to increase developer efficiency
Assisting in building/integrating IoT management tooling (for drones, docks, radars)
Ensuring self-healing and auto scaling strategies are implemented with high availability and cost resource usage efficiency
Standardizing and/or refactoring Terraform modules
Participating in security reviews, remediation, and new implementations for best in class security posture and compliance

Requirements:

Experience in an SRE role with an understanding of monitoring, troubleshooting, and disaster recovery
Extensive experience in writing production-quality code
Proficiency with infrastructure as code and/or configuration management (we use Terraform)
Experience with managing monitoring dashboards using tools like Grafana and Prometheus to create actionable alerts
Ability to ensure the system is running and in line with internal SLIs and SLOs
Experience refining CI/CD processes to ensure new code is pushed to production in a reliable and efficient manner
Ability to collaborate on creating a robust monitoring platform for our services and their underlying infrastructure, aiming to alert on symptoms and not outages
Familiarity with best practices when creating and managing AWS resources (e.g. security groups, VPCs)
Ability to obtain and maintain Criminal Justice Information Services (CJIS) certification as a condition of employment
Applicants must meet all FBI CJIS Security Policy requirements, including a fingerprint-based background check

Site Reliability Engineer III, Aviation

Key skills

About this role

Responsibilities:

Requirements: