System Automation Corporation is hiring a Senior Site Reliability Engineer to join their platform team and help evolve the infrastructure, observability, and security posture of their Azure-based SaaS platform. The role involves collaborating with product engineers and technical leadership to build secure, scalable, and maintainable systems while fostering a strong DevOps culture.
Responsibilities:
- Design and evolve Azure platform infrastructure with a focus on scalability, reliability, and growth readiness
- Participate in capacity planning to support growth, peak demand, and seasonal usage patterns
- Integrate with development resources to implement infrastructure-as-code (e.g., Bicep)
- Troubleshoot production infrastructure issues and lead incident response efforts, including coordination, escalation, and real-time remediation across teams
- Conduct post-incident reviews (postmortems) focused on root cause analysis, corrective actions, and long-term reliability improvements rather than blame
- Monitor and operate production systems using Azure Monitor, Application Insights, Sentinel, and related observability tooling
- Improve system reliability and performance through alerting, error monitoring, SLOs/SLAs, and analysis of performance and capacity trends
- Collaborate with security analyst to define and implement security controls across Azure resources and pipelines
- Manage secrets, certificates, and identity integrations
- Automate security posture checks in CI/CD pipelines
- Maintain policy-as-code using Azure Blueprints or Defender for Cloud
- Act as a key team member in the authorization and enablement of access to secured resources
- Support SOC 2 Type II compliance through tooling, automation, and audit readiness
- Respond to evidence requests and generate reports from observability and security systems
- Contribute to the documentation of platform controls and best practices
- Support, maintain, and own CI/CD pipelines (GitHub Actions, Azure DevOps, or equivalent)
- Optimize build, test, and release flows, partnering with engineers to diagnose failures and improve deployment reliability
- Define and maintain consistent environment standards across development, staging, and production to ensure deployment safety, reliability, and compliance
- Partner with engineering teams to improve deployment promotion strategies, rollback mechanisms, and release safety practices
Requirements:
- 5+ years of experience in Site Reliability, DevOps, or Cloud Infrastructure
- Strong experience in Microsoft Azure, including identity, networking, and monitoring. Specifically needs to have demonstrated experience using and optimizing platform as a service technologies in Azure with an understanding of consumption limitations
- Hands-on experience in DevOps and SRE
- Familiarity with SOC 2 or other compliance frameworks (HIPAA, FedRAMP a plus) as well as how these are implemented and maintained in Azure
- Proficient with scripting or automation (e.g., PowerShell, Bash, Python, etc.)
- Strong collaboration and documentation habits
- Ability to quickly identify and create necessary Azure resources/scripts in support of ongoing operations needs
- Experience optimizing infrastructure for cost management
- Experience with Terraform/Bicep, GitHub Actions
- Exposure to low-code or microservice platforms
- Demonstrated experience using AI tools to optimize work output
- Certifications such as Azure certifications (AZ-104, AZ-400, AZ-500) a plus