Booz Allen Hamilton is seeking a Senior Site Reliability Engineer to enhance system resilience and efficiency for civil and defense agencies. The role involves building robust infrastructures, automating processes, and mentoring junior engineers to improve application reliability and performance.
Responsibilities:
- Work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure
- Build in redundancy, implement monitoring tools, and automate wherever possible
- Reduce toil by scripting routine tasks and automating self-repair
- Leverage expertise in automating resiliency in applications, measuring latency and availability across a wide range of applications while assisting junior engineers and broadening knowledge base
- Coach, assist, and serve as the SRE Champion for product teams and support the operations and maintenance of applications and services
Requirements:
- 5+ years of experience measuring service SLIs using custom metrics, logs, and traces and tools such as Prometheus, Grafana, or OpenTelemetry
- 5+ years of experience developing Infrastructure as Code (IaC) in Terraform
- 5+ years of experience automating operational tasks and identifying and reducing toil
- 5+ years of experience scripting or coding in Python, Go, or Bash
- 5+ years of experience with designing SLIs, SLOs, and error budgets
- 5+ years of experience deploying code via CI/CD pipelines using GitLab or GitHub
- Experience with working in a cloud platforms such as AWS, Azure, or GCP
- Ability to coach, assist, and serve as the SRE Champion for product teams and support the operations and maintenance of applications and services
- Ability to obtain a Secret clearance
- HS diploma or GED
- Experience implementing AI Op
- Experience integrating with ServiceNow
- Experience implementing self-healing solutions
- Experience with application programming interfaces (APIs) and applying advanced SRE practices
- Knowledge of chaos engineering and resilience testing
- Ability to work in an Agile environment and produce operational runbooks and playbooks
- Ability to pay strict attention to detail
- Cloud Certification