Senior Site Reliability Engineer — Government & Sovereign Cloud

United States of America

Full Time

1 week ago

$152,000 - $252,000 USD

No H1B

Key skills

Site Reliability EngineeringPlatform EngineeringGovernment Cloud Azure GovernmentGovernment Cloud AWS GovCloudRegulated Compliance Environments FedRAMPRegulated Compliance Environments CMMCRegulated Compliance Environments IL2/IL4/IL5Regulated Compliance Environments PCI-DSSRegulated Compliance Environments SOXRegulated Compliance Environments HIPAARegulated Compliance Environments HITRUSTCloud Infrastructure (Azure preferred)Programming TypeScriptProgramming JavaScriptProgramming GoProgramming JavaProgramming C#MonitoringObservability Tools PrometheusObservability Tools GrafanaObservability Tools OpenTelemetryObservability Tools ELK stackInfrastructure as Code TerraformInfrastructure as Code TerragruntInfrastructure as Code PulumiInfrastructure as Code Azure ARM templatesInfrastructure as Code AWS CloudFormationInfrastructure as Code Serverless FrameworkContainer Orchestration (Kubernetes)CI/CDGitOps Tooling GitHub ActionsGitOps Tooling Azure DevOpsGitOps Tooling GitLab CIGitOps Tooling ArgoCDGitOps Tooling FluxCDGitOps Tooling DaggerDistributed SystemsNetworkingCloud-native ArchitectureClear writtenTypeScriptJavaC#CGoAILLMAWSAzureTerraformKubernetesGitHub ActionsGitLab CIPulumiTerragruntArgoCDAzure DevOpsPrometheusGrafanaELK StackOpenTelemetryGitHubGitLabGitOpsSaaSCommunication

About this role

Veeam Software is the Data and AI Trust Company, specializing in data resilience and security. The Senior Site Reliability Engineer will support the Veeam Data Cloud, focusing on the Government and Sovereign Cloud environment, and will be responsible for ensuring high availability, incident response, and observability across the platform.

Responsibilities:

Get up to speed on the full platform — all VDC workloads, dependencies, and risk areas. Much of this will happen through code, docs, and conversations rather than direct environment access. Work with SMEs across the org to fill knowledge gaps and build onboarding material for the team. Write and maintain runbooks, architecture docs, and operational guides
Design infrastructure for high availability and fault tolerance on Azure (including Azure Government). Define SLIs, SLOs, and error budgets where none exist today. Run incident response and blameless postmortems. Turn incidents into improvements. Identify reliability risks across modern and legacy workloads and build practical remediation plans that work within compliance constraints
Close observability gaps — define instrumentation requirements and drive implementation. Set alerting, telemetry, and monitoring standards with partner teams. Build automation to reduce toil and support fleet management. Participate in on-call rotations
Work with IaC, CI/CD, deployment automation, and config management — including in air-gapped or compliance-restricted environments. Build and maintain testing, canary deployment, and release validation pipelines. Integrate chaos engineering and monitoring tools, adapting choices to meet regulatory requirements
Work across product, platform, security, legal, compliance, and operations teams. Own problems end-to-end — identify gaps, drive solutions, don't wait for direction. Mentor other engineers and help spread SRE practices across the org

Requirements:

7+ years in Software Engineering, with 3+ years in SRE, Platform Engineering, or similar — across multi-service platforms, not just single-service environments
Experience with Government or Sovereign Cloud (e.g., Azure Government, AWS GovCloud)
Experience in regulated compliance environments — government (FedRAMP, CMMC, IL2/IL4/IL5), financial (PCI-DSS, SOX), or healthcare (HIPAA, HITRUST)
Strong experience building and running production services on cloud infrastructure (Azure preferred, including Azure Government)
Able to learn large, complex platforms quickly with limited guidance — comfortable building understanding from code, docs, and architecture artifacts when direct environment access is restricted
Can investigate systems independently and produce clear docs, risk assessments, and improvement plans
Comfortable working across teams — engineering, product, security, compliance, operations
Programming skills in one or more of: TypeScript/JS, Go, Java, C#, or similar
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, OpenTelemetry, ELK stack)
Experience with IaC (Terraform, Terragrunt, Pulumi) and container orchestration (Kubernetes)
Experience with CI/CD and GitOps tooling — GitHub Actions, Azure DevOps, GitLab CI, ArgoCD, FluxCD, or Dagger
Solid grasp of distributed systems, networking, and cloud-native architecture
Clear written and verbal communication skills
Experience on B2B SaaS platforms in regulated or government markets
Background in chaos engineering, resilience testing, or performance/load testing
Have built an SRE or reliability function from scratch before
Experience across mixed environments — modern cloud-native and older legacy systems
Familiar with AI-first development workflows — using LLM-powered tools for infrastructure automation, code generation, and documentation