Realtor.com is a leading online real estate platform that connects buyers, sellers, and renters with expert guidance. They are seeking a Senior Site Reliability Engineer to enhance the reliability and operational excellence of their platform infrastructure, contributing to critical systems and implementing best practices for observability and cost optimization.

Responsibilities:

Implement and maintain highly available AWS infrastructure including EKS clusters, Fargate (ECS), and multi-region architectures
Support reliability of critical services: Skyway (CI/CD), Frontdoor (Tyk), Pantheon (Apollo GraphQL), and supporting infrastructure
Monitor SLIs, SLOs, and error budgets for Tier 1/2/3 systems; participate in architectural reviews for reliability and cost-efficiency
Implement reliability patterns including circuit breakers, graceful degradation, and automated failover
Implement observability solutions using NewRelic for APM, distributed tracing, metrics, and logging for rapid troubleshooting
Build dashboards and alerts that reduce MTTD and MTTR; contribute to observability standards across teams
Identify infrastructure cost optimization opportunities and implement FinOps practices including rightsizing and resource lifecycle management
Support cost-conscious architecture decisions and CI/CD spend optimization (CircleCI, Argo CD)
Execute chaos engineering experiments to identify system weaknesses; contribute to frameworks for safe production testing
Participate in game day exercises and disaster recovery simulations; create runbooks and automation for resilience
Participate in on-call rotation for critical systems; conduct post-incident reviews and implement improvements
Support incident response processes and contribute to System Health Scorecard
Contribute as a strong technical individual contributor to the Operations Excellence team
Collaborate with Platform Engineering, Quality Engineering, and product teams on reliability initiatives
Support security initiatives including AWS Secrets Manager migration and compliance requirements (SOC 2, PCI, GDPR)
Contribute to Developer Experience metrics and platform adoption goals
May provide technical guidance to junior team members

Requirements:

5+ years in Site Reliability Engineering, DevOps, or Infrastructure Engineering with demonstrated success improving system reliability
Bachelor's degree or equivalent experience
3+ years hands-on experience with AWS (EKS, EC2, RDS, S3, CloudWatch, IAM) and Kubernetes including cluster management
Proficient programming skills (Python, Go, or Java) with infrastructure automation and Infrastructure as Code experience (Terraform, CloudFormation)
Production experience with observability tools (NewRelic, Datadog, Prometheus, Grafana, Splunk) and distributed systems
Experience with CI/CD platforms and GitOps workflows (CircleCI, Argo CD, Jenkins); on-call rotation and incident response
Strong communication skills with ability to explain technical concepts to diverse audiences
Collaborative approach working across engineering, product, and business teams
Self-motivated with ability to solve complex problems within established practices and policies
Data-driven decision making with customer-centric approach and empathy for developer experience
Exposure to chaos engineering tools
API Gateway technologies (Tyk/Kong)
GraphQL federation (Apollo)
Cost optimization initiatives
FinOps principles

Senior SRE Engineer

Key skills

About this role

Responsibilities:

Requirements: