Booz Allen Hamilton is seeking a Senior Site Reliability Engineer to enhance system resilience and efficiency. The role involves developing robust systems for civil and defense agencies, implementing monitoring tools, and automating operational tasks.
Responsibilities:
- Work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure
- Build in redundancy, implement monitoring tools, and automate wherever possible
- Reduce toil by scripting routine tasks and automating self-repair
- Leverage expertise in automating resiliency in applications, measuring latency and availability across wide range of applications while assisting junior engineers and broadening your knowledge base
Requirements:
- 5+ years of experience measuring service SLIs using custom metrics, logs, and traces and tools such as Prometheus, Grafana, or OpenTelemetry
- 5+ years of experience developing Infrastructure as Code (IaC) in Terraform
- 5+ years of experience automating operational tasks and identifying and reducing toil
- 5+ years of experience scripting or coding in Python, Go, or Bash
- 5+ years of experience designing SLIs, SLOs, and error budgets
- 5+ years of experience deploying code via CI/CD pipelines using GitLab or GitHub
- Experience working in cloud platforms such as AWS, Azure, or GCP
- Ability to coach, assist, and serve as the SRE Champion for product teams and support the operations and maintenance of applications and services
- Ability to obtain a Secret clearance
- HS diploma or GED
- Experience implementing AI Op
- Experience integrating with ServiceNow
- Experience implementing self-healing solutions
- Experience with application programming interfaces (APIs) and applying advanced SRE practices
- Knowledge of chaos engineering and resilience testing
- Ability to work in an Agile environment and produce operational runbooks and playbooks
- Ability to pay strict attention to detail
- Cloud Certification