Niche is the leader in school search, dedicated to making the process of researching and enrolling in schools easy, transparent, and free. They are looking for a Senior Site Reliability Engineer to take ownership of reliability outcomes for critical services, lead incident response efforts, and mentor team members while driving improvements across their platform.
Responsibilities:
- Own and architect cloud infrastructure across AWS and GCP, including EC2, EKS/Kubernetes, RDS, ElastiCache, S3, and networking components (VPCs, load balancers, DNS), driving improvements that increase reliability and reduce operational burden
- Lead the design and implementation of secrets management strategies using Hashicorp Vault and other tools, establishing organizational standards for secure configuration management
- Architect and evolve infrastructure-as-code practices using Terraform, driving adoption of patterns that improve consistency and reduce deployment risk
- Design and optimize deployment pipelines and CI/CD systems, troubleshoot complex deployment failures with Git and FluxCD, and establish best practices for safe, reliable releases
- Support database operations including migrations and performance tuning
- Own Kafka clusters and message queue systems, including architecture decisions, capacity planning, and troubleshooting complex processing issues
- Participate in 24/7 oncall rotations, responding to alerts, triaging incidents, and coordinating with development teams to resolve production issues
- Design and implement monitoring, alerting, and observability strategies using Prometheus, Grafana, Sumo Logic, and related tools, establishing organizational standards that catch issues before customers notice them
- Define and own Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical services, balancing business needs with engineering resources
- Lead blameless post-mortems, write comprehensive incident analyses that teach others, and drive systemic improvements that prevent entire classes of incidents
- Champion access controls, IAM policies, and security configurations across cloud environments, ensuring infrastructure meets compliance and security requirements
- Identify and eliminate systemic sources of operational toil by designing automation, building self-service tooling, and improving developer workflows that scale the team's impact
- Lead AI-assisted automation initiatives to streamline SRE processes, implementing solutions that reduce toil and improve incident response
- Partner with product development teams as the reliability subject matter expert, providing architecture guidance, production readiness reviews, and proactive capacity planning
- Mentor and coach SRE team members, helping them develop technical skills and operational judgment through pairing, code review, and incident response shadowing
- Lead knowledge sharing initiatives, demos, and cross-team collaboration to elevate reliability culture and operational excellence across the engineering organization
- Learn about Niche by meeting with various team members to learn more about our company through our Onboarding meetings
- Shadow SRE team members to learn about our tech stack (AWS, GCP, Kubernetes, Terraform, Vault), the products we support, and our development standards
- Gain access to production systems, observability tools, and documentation
- Begin contributing to bug fixes, documentation improvements, and small infrastructure tasks for initial exposure and impact
- Gain familiarity with our platform's underlying application stacks, deployment processes, and software development lifecycle
- Collaborate with SRE team members to implement new features and improvements within our infrastructure
- Participate in code reviews and provide constructive feedback on infrastructure changes
- Have the skills and knowledge to help analyze and resolve production issues, becoming a participant in oncall rotations
- Begin partnering with product development teams to provide platform reliability guidance
- Continue gaining exposure to critical subsystems including databases, data automation, Kafka, task orchestration, and observability platforms
- Be an advocate for reliability standards within your product team partnerships, empowering developers to move faster with confidence
- Contribute to automation initiatives that reduce operational toil and improve team efficiency
- Support compliance efforts by maintaining security controls and contributing to audit evidence collection
- Confidently troubleshoot and resolve complex production issues across our distributed systems
- Identify areas for improvement in our infrastructure, research best practices, and make recommendations to the team
- Use your growing knowledge of our applications to help developers implement changes that increase reliability
- Contribute to defining SLIs and SLOs for services and help establish observability coverage
- Practice and help define what it means to be an SRE at Niche
Requirements:
- 5+ years experience with cloud platforms (AWS or GCP) and container orchestration systems (Kubernetes/Docker)
- Experience with cloud networking concepts and services including VPCs, subnets, security groups, NAT gateways, VPC peering, load balancers, and DNS management (Route 53, Cloud DNS)
- Strong programming skills in one or more languages (Python, Go, Bash) with demonstrated ability to build automation and tooling
- Advanced experience with Infrastructure as Code tools (Terraform, Helm, Ansible) including module design and organizational standards
- Deep understanding of Linux systems administration and networking fundamentals (TCP/IP, DNS, load balancing, distributed systems)
- Experience with SQL databases (PostgreSQL, MySQL, or SQL Server) including performance tuning and capacity planning
- Experience designing and operating CI/CD pipelines for reliable software delivery
- Track record of leading incident response and driving complex issues to resolution
- Demonstrated ability to mentor engineers and contribute to team technical growth
- Excellent collaboration and communication skills, with ability to influence technical decisions across teams
- Experience designing and implementing observability strategies using Prometheus, Grafana, Datadog, Sumo Logic, or similar platforms
- Deep understanding of SRE principles including SLIs, SLOs, error budgets, toil reduction, and reliability engineering practices
- Experience operating message queue systems (Kafka, RabbitMQ, or similar) at scale
- Experience with secrets management tools (HashiCorp Vault, AWS Secrets Manager) including design of organizational policies
- Experience with cloud systems infrastructure design, capacity planning, and cost optimization
- Interest in leveraging AI and automation tooling (such as MCP servers, agentic workflows, or LLM-assisted operations) to streamline SRE responsibilities
- Bachelor's degree in Computer Science, a related field, or equivalent experience