Maintain and improve the availability, performance, and reliability of production and non-production environments.
Proactively identify scalability and capacity risks and recommend mitigation strategies as platform demands grow.
Enhance system observability through monitoring, logging, and alerting, and help define reliability metrics as systems scale.
Lead incident investigations and drive root cause analysis, ensuring systemic improvements are implemented.
Shape and evolve reliability standards and practices while remaining directly engaged in hands-on system improvements.
Build, own, and continuously improve CI/CD pipelines to support reliable, repeatable deployments.
Drive automation of infrastructure provisioning, configuration, and operational workflows to reduce manual effort and operational risk.
Develop and implement tooling that improves system performance, observability, and deployment confidence.
Partner with software engineers to standardize and improve deployment practices, release processes, and operational readiness across services.
Establish and enforce best practices for access controls, secrets management, and system hardening.
Ensure backup, recovery, and disaster-readiness strategies are tested and reliable.
Partner with engineering leadership on security reviews and compliance-related initiatives.
Proactively identify and mitigate infrastructure and operational risks.

7+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or a related role.
Strong troubleshooting skills with experience leading incident response efforts and driving systemic remediation improvements in production environments.
Strong experience scaling and operating cloud-based production systems (AWS, GCP, or Azure).
Experience designing and maintaining CI/CD pipelines and deployment automation.
Experience with monitoring, logging, and alerting systems for reliability and performance.
Strong understanding of cloud security fundamentals, including access controls, secrets management, and backup strategies.
Proficiency in at least one scripting or programming language (e.g., Python, Go, Bash).
Working knowledge of infrastructure-as-code tools (e.g., Terraform, CloudFormation) and containerization/orchestration technologies (Docker, Kubernetes).
Strong written and verbal communication skills and experience collaborating with cross-functional teams.
Ability to work on-site at least three days per week (approximately 60%) in our Franklin, TN office.

Senior Site Reliability Engineer

Key skills