DemandBridge is a company that operates mission-critical platforms supporting core business and customer-facing systems. They are seeking a Senior Site Reliability Engineer to ensure the reliability and operational readiness of their cloud platform and Azure-based infrastructure.
Responsibilities:
- Own and operate a production cloud platform running on Microsoft Azure and Cloud Foundry (or comparable platforms)
- Ensure availability, performance, and reliability across infrastructure and platform components
- Serve as the primary escalation point for platform-level incidents
- Lead incident response, root cause analysis, and post-incident remediation
- Use modern monitoring, alerting, and AI-assisted observability tools to improve detection, diagnosis, and resolution of incidents
- Drive continuous improvements to reduce operational risk, after-hours incidents, and manual intervention
- Own certificate and secrets lifecycle management, including TLS automation and secure secrets handling (e.g., CredHub, Vault)
- Ensure secure and compliant practices around identity, access, and credential management
- Partner with engineering teams to embed security and reliability best practices into platform workflows
- Automate common operational tasks using Bash and/or PowerShell
- Support and extend infrastructure-as-code using Terraform and/or Bicep
- Improve platform consistency and repeatability through Git-driven, automation-first workflows
- Leverage AI-assisted tooling to support scripting, troubleshooting, and operational documentation
- Support PCI and other compliance activities, including technical control implementation, audit support, and remediation tracking
- Maintain clear runbooks, diagrams, and documentation to enable repeatable operations and knowledge transfer
- Partner with internal teams and external auditors to support compliance requirements
- Work closely with application engineers, junior SRE/support staff, and vendor partners
- Provide technical guidance and mentorship to junior teammates
- Act as a trusted partner to engineering teams on reliability, performance, and operational readiness
Requirements:
- 5+ years of experience in SRE, DevOps, or infrastructure engineering roles supporting production environments
- Hands-on experience with Cloud Foundry, Kubernetes, or Docker in production (Cloud Foundry preferred)
- Strong experience with Microsoft Azure, including networking, compute, IAM, and monitoring
- Strong Linux systems administration experience (RHEL preferred); comfort with Windows Server environments
- Proficiency in PowerShell and/or Bash scripting
- Solid understanding of TLS/PKI workflows, including certificate management and rotation
- Proven experience managing incidents end-to-end and performing root cause analysis
- Strong written communication skills and a disciplined approach to documentation
- Experience using modern automation, observability, or AI-enabled operational tools to improve reliability and efficiency
- Experience with BOSH, CredHub, Vault, or similar infrastructure tooling
- Exposure to PCI or other compliance frameworks and audit cycles
- Familiarity with VPN gateways, DNS management, or email infrastructure (SMTP, SPF/DKIM/DMARC)
- Experience operating in Git-driven, automation-heavy environments