Veeam Software is the Data and AI Trust Company, specializing in ensuring data resilience and security. They are seeking a Senior Site Reliability Engineer to build a global SRE function for their Government and Sovereign Cloud environment, focusing on high availability, incident response, and collaboration across teams.
Responsibilities:
- Get up to speed on the full platform — all VDC workloads, dependencies, and risk areas. Much of this will happen through code, docs, and conversations rather than direct environment access
- Work with SMEs across the org to fill knowledge gaps and build onboarding material for the team
- Write and maintain runbooks, architecture docs, and operational guides
- Design infrastructure for high availability and fault tolerance on Azure (including Azure Government)
- Define SLIs, SLOs, and error budgets where none exist today
- Run incident response and blameless postmortems. Turn incidents into improvements
- Identify reliability risks across modern and legacy workloads and build practical remediation plans that work within compliance constraints
- Close observability gaps — define instrumentation requirements and drive implementation
- Set alerting, telemetry, and monitoring standards with partner teams
- Build automation to reduce toil and support fleet management
- Participate in on-call rotations
- Work with IaC, CI/CD, deployment automation, and config management — including in air-gapped or compliance-restricted environments
- Build and maintain testing, canary deployment, and release validation pipelines
- Integrate chaos engineering and monitoring tools, adapting choices to meet regulatory requirements
- Work across product, platform, security, legal, compliance, and operations teams
- Own problems end-to-end — identify gaps, drive solutions, don't wait for direction
- Mentor other engineers and help spread SRE practices across the org
Requirements:
- 7+ years in Software Engineering, with 3+ years in SRE, Platform Engineering, or similar — across multi-service platforms, not just single-service environments
- Experience with Government or Sovereign Cloud (e.g., Azure Government, AWS GovCloud)
- Experience in regulated compliance environments — government (FedRAMP, CMMC, IL2/IL4/IL5), financial (PCI-DSS, SOX), or healthcare (HIPAA, HITRUST). You understand how compliance shapes architecture and operations
- Strong experience building and running production services on cloud infrastructure (Azure preferred, including Azure Government)
- Able to learn large, complex platforms quickly with limited guidance — comfortable building understanding from code, docs, and architecture artifacts when direct environment access is restricted
- Can investigate systems independently and produce clear docs, risk assessments, and improvement plans
- Comfortable working across teams — engineering, product, security, compliance, operations
- Programming skills in one or more of: TypeScript/JS, Go, Java, C#, or similar
- Experience with monitoring and observability tools (e.g., Prometheus, Grafana, OpenTelemetry, ELK stack)
- Experience with IaC (Terraform, Terragrunt, Pulumi) and container orchestration (Kubernetes)
- Experience with CI/CD and GitOps tooling — GitHub Actions, Azure DevOps, GitLab CI, ArgoCD, FluxCD, or Dagger
- Solid grasp of distributed systems, networking, and cloud-native architecture
- Clear written and verbal communication skills
- Experience on B2B SaaS platforms in regulated or government markets
- Background in chaos engineering, resilience testing, or performance/load testing
- Have built an SRE or reliability function from scratch before
- Experience across mixed environments — modern cloud-native and older legacy systems
- Familiar with AI-first development workflows — using LLM-powered tools for infrastructure automation, code generation, and documentation