Flock Safety is the leading safety technology platform, helping communities thrive by taking a proactive approach to crime prevention and security. The Site Reliability Engineer III in Aviation will be responsible for designing and building systems and processes that provide a scalable and observable platform, empowering development teams to manage their full application stack efficiently.

Responsibilities:

Designing and building systems, tooling, and processes to provide an extensible, scalable, and observable platform
Empowering development teams to own and manage their full application stack, minimizing bottlenecks and optimizing development velocity without compromising on reliability
Experience in an SRE role with an understanding of monitoring, troubleshooting, and disaster recovery
Extensive experience in writing production-quality code
Proficiency with infrastructure as code and/or configuration management (we use Terraform)
Experience with managing monitoring dashboards using tools like Grafana and Prometheus to create actionable alerts
Ability to ensure the system is running and in line with internal SLIs and SLOs
Experience refining CI/CD processes to ensure new code is pushed to production in a reliable and efficient manner
Ability to collaborate on creating a robust monitoring platform for our services and their underlying infrastructure, aiming to alert on symptoms and not outages
Familiarity with best practices when creating and managing AWS resources (e.g. security groups, VPCs)
Onboarding and making a first day push to infrastructure using Terraform
Meeting the software engineering teams and learning the Aerodome Platform codebase
Learning Flock Safety’s AWS infrastructure, security, SRE, and engineering architecture, tooling, and policies
Creating and deploying production releases of the Aerodome platform
Reviewing and improving coverage of observability within the Aerodome platform through metrics/logs/traces/profiling
Contributing to project development in a supportive role and advising software engineering teams of best practices
Improving the CI pipeline and removing inefficiencies
Creating or improving existing Helm charts to automate and structure
Improving automation of the Aviation infrastructure
Contributing to the Platform Engineering efforts, building additional self-service ability to increase developer efficiency
Assisting in building/integrating IoT management tooling
Ensuring self-healing and auto scaling strategies are implemented with high availability and cost resource usage efficiency
Participating in security reviews, remediation, and new implementations for best in class security posture and compliance

Requirements:

Experience in an SRE role with an understanding of monitoring, troubleshooting, and disaster recovery
Extensive experience in writing production-quality code
Proficiency with infrastructure as code and/or configuration management (we use Terraform)
Experience with managing monitoring dashboards using tools like Grafana and Prometheus to create actionable alerts
Ability to ensure the system is running and in line with internal SLIs and SLOs
Experience refining CI/CD processes to ensure new code is pushed to production in a reliable and efficient manner
Ability to collaborate on creating a robust monitoring platform for our services and their underlying infrastructure, aiming to alert on symptoms and not outages
Familiarity with best practices when creating and managing AWS resources (e.g. security groups, VPCs)
Ability to obtain and maintain Criminal Justice Information Services (CJIS) certification as a condition of employment
Applicants must meet all FBI CJIS Security Policy requirements, including a fingerprint-based background check

Site Reliability Engineer III, Aviation

Key skills

About this role

Responsibilities:

Requirements: