Flock Safety is the leading safety technology platform, helping communities thrive by taking a proactive approach to crime prevention and security. The Site Reliability Engineer III in Aviation will be responsible for designing and building systems and processes that provide a scalable and observable platform, empowering development teams to manage their full application stack efficiently.
Responsibilities:
- Designing and building systems, tooling, and processes to provide an extensible, scalable, and observable platform
- Empowering development teams to own and manage their full application stack, minimizing bottlenecks and optimizing development velocity without compromising on reliability
- Experience in an SRE role with an understanding of monitoring, troubleshooting, and disaster recovery
- Extensive experience in writing production-quality code
- Proficiency with infrastructure as code and/or configuration management (we use Terraform)
- Experience with managing monitoring dashboards using tools like Grafana and Prometheus to create actionable alerts
- Ability to ensure the system is running and in line with internal SLIs and SLOs
- Experience refining CI/CD processes to ensure new code is pushed to production in a reliable and efficient manner
- Ability to collaborate on creating a robust monitoring platform for our services and their underlying infrastructure, aiming to alert on symptoms and not outages
- Familiarity with best practices when creating and managing AWS resources (e.g. security groups, VPCs)
- Onboarding and making a first day push to infrastructure using Terraform
- Meeting the software engineering teams and learning the Aerodome Platform codebase
- Learning Flock Safety’s AWS infrastructure, security, SRE, and engineering architecture, tooling, and policies
- Creating and deploying production releases of the Aerodome platform
- Reviewing and improving coverage of observability within the Aerodome platform through metrics/logs/traces/profiling
- Contributing to project development in a supportive role and advising software engineering teams of best practices
- Improving the CI pipeline and removing inefficiencies
- Creating or improving existing Helm charts to automate and structure
- Improving automation of the Aviation infrastructure
- Contributing to the Platform Engineering efforts, building additional self-service ability to increase developer efficiency
- Assisting in building/integrating IoT management tooling
- Ensuring self-healing and auto scaling strategies are implemented with high availability and cost resource usage efficiency
- Participating in security reviews, remediation, and new implementations for best in class security posture and compliance
Requirements:
- Experience in an SRE role with an understanding of monitoring, troubleshooting, and disaster recovery
- Extensive experience in writing production-quality code
- Proficiency with infrastructure as code and/or configuration management (we use Terraform)
- Experience with managing monitoring dashboards using tools like Grafana and Prometheus to create actionable alerts
- Ability to ensure the system is running and in line with internal SLIs and SLOs
- Experience refining CI/CD processes to ensure new code is pushed to production in a reliable and efficient manner
- Ability to collaborate on creating a robust monitoring platform for our services and their underlying infrastructure, aiming to alert on symptoms and not outages
- Familiarity with best practices when creating and managing AWS resources (e.g. security groups, VPCs)
- Ability to obtain and maintain Criminal Justice Information Services (CJIS) certification as a condition of employment
- Applicants must meet all FBI CJIS Security Policy requirements, including a fingerprint-based background check