Realtor.com is a leading online real estate platform that connects buyers, sellers, and renters with expert guidance. They are seeking a Senior Site Reliability Engineer to enhance the reliability and operational excellence of their platform infrastructure, contributing to critical systems and implementing best practices for observability and cost optimization.
Responsibilities:
- Implement and maintain highly available AWS infrastructure including EKS clusters, Fargate (ECS), and multi-region architectures
- Support reliability of critical services: Skyway (CI/CD), Frontdoor (Tyk), Pantheon (Apollo GraphQL), and supporting infrastructure
- Monitor SLIs, SLOs, and error budgets for Tier 1/2/3 systems; participate in architectural reviews for reliability and cost-efficiency
- Implement reliability patterns including circuit breakers, graceful degradation, and automated failover
- Implement observability solutions using NewRelic for APM, distributed tracing, metrics, and logging for rapid troubleshooting
- Build dashboards and alerts that reduce MTTD and MTTR; contribute to observability standards across teams
- Identify infrastructure cost optimization opportunities and implement FinOps practices including rightsizing and resource lifecycle management
- Support cost-conscious architecture decisions and CI/CD spend optimization (CircleCI, Argo CD)
- Execute chaos engineering experiments to identify system weaknesses; contribute to frameworks for safe production testing
- Participate in game day exercises and disaster recovery simulations; create runbooks and automation for resilience
- Participate in on-call rotation for critical systems; conduct post-incident reviews and implement improvements
- Support incident response processes and contribute to System Health Scorecard
- Contribute as a strong technical individual contributor to the Operations Excellence team
- Collaborate with Platform Engineering, Quality Engineering, and product teams on reliability initiatives
- Support security initiatives including AWS Secrets Manager migration and compliance requirements (SOC 2, PCI, GDPR)
- Contribute to Developer Experience metrics and platform adoption goals
- May provide technical guidance to junior team members
Requirements:
- 5+ years in Site Reliability Engineering, DevOps, or Infrastructure Engineering with demonstrated success improving system reliability
- Bachelor's degree or equivalent experience
- 3+ years hands-on experience with AWS (EKS, EC2, RDS, S3, CloudWatch, IAM) and Kubernetes including cluster management
- Proficient programming skills (Python, Go, or Java) with infrastructure automation and Infrastructure as Code experience (Terraform, CloudFormation)
- Production experience with observability tools (NewRelic, Datadog, Prometheus, Grafana, Splunk) and distributed systems
- Experience with CI/CD platforms and GitOps workflows (CircleCI, Argo CD, Jenkins); on-call rotation and incident response
- Strong communication skills with ability to explain technical concepts to diverse audiences
- Collaborative approach working across engineering, product, and business teams
- Self-motivated with ability to solve complex problems within established practices and policies
- Data-driven decision making with customer-centric approach and empathy for developer experience
- Exposure to chaos engineering tools
- API Gateway technologies (Tyk/Kong)
- GraphQL federation (Apollo)
- Cost optimization initiatives
- FinOps principles