DemandBridge is a company that operates mission-critical platforms supporting core business and customer-facing systems. They are seeking a Senior Site Reliability Engineer to ensure the reliability and operational readiness of their cloud platform and Azure-based infrastructure.

Responsibilities:

Own and operate a production cloud platform running on Microsoft Azure and Cloud Foundry (or comparable platforms)
Ensure availability, performance, and reliability across infrastructure and platform components
Serve as the primary escalation point for platform-level incidents
Lead incident response, root cause analysis, and post-incident remediation
Use modern monitoring, alerting, and AI-assisted observability tools to improve detection, diagnosis, and resolution of incidents
Drive continuous improvements to reduce operational risk, after-hours incidents, and manual intervention
Own certificate and secrets lifecycle management, including TLS automation and secure secrets handling (e.g., CredHub, Vault)
Ensure secure and compliant practices around identity, access, and credential management
Partner with engineering teams to embed security and reliability best practices into platform workflows
Automate common operational tasks using Bash and/or PowerShell
Support and extend infrastructure-as-code using Terraform and/or Bicep
Improve platform consistency and repeatability through Git-driven, automation-first workflows
Leverage AI-assisted tooling to support scripting, troubleshooting, and operational documentation
Support PCI and other compliance activities, including technical control implementation, audit support, and remediation tracking
Maintain clear runbooks, diagrams, and documentation to enable repeatable operations and knowledge transfer
Partner with internal teams and external auditors to support compliance requirements
Work closely with application engineers, junior SRE/support staff, and vendor partners
Provide technical guidance and mentorship to junior teammates
Act as a trusted partner to engineering teams on reliability, performance, and operational readiness

Requirements:

5+ years of experience in SRE, DevOps, or infrastructure engineering roles supporting production environments
Hands-on experience with Cloud Foundry, Kubernetes, or Docker in production (Cloud Foundry preferred)
Strong experience with Microsoft Azure, including networking, compute, IAM, and monitoring
Strong Linux systems administration experience (RHEL preferred); comfort with Windows Server environments
Proficiency in PowerShell and/or Bash scripting
Solid understanding of TLS/PKI workflows, including certificate management and rotation
Proven experience managing incidents end-to-end and performing root cause analysis
Strong written communication skills and a disciplined approach to documentation
Experience using modern automation, observability, or AI-enabled operational tools to improve reliability and efficiency
Experience with BOSH, CredHub, Vault, or similar infrastructure tooling
Exposure to PCI or other compliance frameworks and audit cycles
Familiarity with VPN gateways, DNS management, or email infrastructure (SMTP, SPF/DKIM/DMARC)
Experience operating in Git-driven, automation-heavy environments

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: