Booz Allen Hamilton is seeking a Senior Site Reliability Engineer to enhance system resilience and efficiency. The role involves developing robust systems for civil and defense agencies, implementing monitoring tools, and automating operational tasks.

Responsibilities:

Work with civil and defense agencies on the development of more robust systems by building a resilient infrastructure
Build in redundancy, implement monitoring tools, and automate wherever possible
Reduce toil by scripting routine tasks and automating self-repair
Leverage expertise in automating resiliency in applications, measuring latency and availability across wide range of applications while assisting junior engineers and broadening your knowledge base

Requirements:

5+ years of experience measuring service SLIs using custom metrics, logs, and traces and tools such as Prometheus, Grafana, or OpenTelemetry
5+ years of experience developing Infrastructure as Code (IaC) in Terraform
5+ years of experience automating operational tasks and identifying and reducing toil
5+ years of experience scripting or coding in Python, Go, or Bash
5+ years of experience designing SLIs, SLOs, and error budgets
5+ years of experience deploying code via CI/CD pipelines using GitLab or GitHub
Experience working in cloud platforms such as AWS, Azure, or GCP
Ability to coach, assist, and serve as the SRE Champion for product teams and support the operations and maintenance of applications and services
Ability to obtain a Secret clearance
HS diploma or GED
Experience implementing AI Op
Experience integrating with ServiceNow
Experience implementing self-healing solutions
Experience with application programming interfaces (APIs) and applying advanced SRE practices
Knowledge of chaos engineering and resilience testing
Ability to work in an Agile environment and produce operational runbooks and playbooks
Ability to pay strict attention to detail
Cloud Certification

Site Reliability Engineer, Senior

Key skills

About this role

Responsibilities:

Requirements: