Serve as the primary point of contact for several critical production SaaS applications hosted in Azure, ensuring their availability, performance, and reliability.
Maintain and support infrastructure within a FedRAMP High authorized environment, ensuring continuous compliance with NIST 800-53 controls and participating in audit readiness activities
Configure, monitor, troubleshoot, and resolve complex cloud infrastructure and application issues across multiple environments.
Ensure critical SLAs are met, including participation in an on-call rotation for weekends and emergencies.
Develop and maintain automation solutions for monitoring, alert mitigation, telemetry, log analysis, and incident response.
Contribute to security documentation including system security plans, standard operating procedures, and runbooks
Apply observability best practices to proactively detect and mitigate issues using logging, metrics, tracing, and alerting tools.
Partner with engineering, security, and product teams to drive reliability improvements and ensure services are built with SRE principles from the ground up.
Lead and contribute to post-incident reviews, identifying root causes, and implementing preventive actions.
Requirements
8+ years of relevant experience in Site Reliability Engineering, DevOps, or Cloud Administration.
Strong background in integrating, upgrading, securing, and supporting software systems across heterogeneous environments.
Proven hands-on experience as a Cloud Administrator with Azure, including microservices on AKS (Azure Kubernetes Service), cloud concepts, and cloud security.
Scripting and programming experience: PowerShell, Python, and markup languages such as XML, JSON, and YAML.
Infrastructure-as-code expertise with Terraform and Azure DevOps pipelines.
Knowledge of redundancy, backup, and disaster recovery strategies in cloud environments.
Hands-on expertise with monitoring and observability tools such as Datadog, Azure Application Insights, Log Analytics
Strong understanding of networking fundamentals, including firewalls, VLANs, NAT, NACLs, load balancing, VPN tunnels, DNS, DHCP, and packet filtering.
Direct experience operating in FedRAMP environments, with working knowledge of NIST 800-53 controls, ConMon requirements, and boundary protection