McGraw Hill is a leading provider of digital educational resources and content, seeking a Lead Site Reliability Engineer to lead a team of 6 Engineers for their Digital Platform Group. This role involves ensuring the reliability, scalability, and performance of K–12 learning platforms that serve millions of students and educators nationwide.
Responsibilities:
- Lead a 6 member SRE team supporting production infrastructure and services
- Manage backlog, sprint planning, and team velocity
- Own reliability, uptime, security, cost, and performance of services
- Define and monitor SLOs for application workloads
- Plan on-call rotations and work to reduce alert fatigue
- Forecast seasonal growth and capacity planning
- Mentor engineers and foster professional growth
- Report status and issues to leadership monthly
- Partner with development teams
- Collaborate with CyberSecurity on risk mitigation
- Collaborate with FinOps on cost reduction
- Design and troubleshoot highly-distributed, cloud-based production systems
- Maintain infrastructure-as-code and monitoring-as-code practices
- Improve system resiliency through failure injection and chaos testing
- Participate in on-call rotation and resolve operational issues
- Optimize existing systems for performance and cost
- Ensure telemetry provides visibility to application performance
- Support agile development practices and code reviews
Requirements:
- 5+ years of experience in SRE, DevOps, or Software Engineering roles supporting enterprise applications
- Strong problem-solving, triage, and root cause analysis skills with a systems engineering mindset
- Deep expertise in the AWS ecosystem, with hands-on experience across core services including primarily ECS, RDS, EKS, IAM, CloudWatch, and networking configurations
- Expertise with Terraform for managing and automating scalable cloud infrastructure
- Skilled in CI/CD pipelines (e.g., GitHub Actions) and managing end-to-end software delivery lifecycles
- Strong familiarity with telemetry and observability tools (e.g., New Relic, Datadog), including querying logs and metrics for performance monitoring