Redfin is revolutionizing the real estate industry by utilizing data and innovative design. The Staff Infrastructure Reliability Engineer will provide technical leadership for Redfin’s production database and storage systems, collaborating with various teams to enhance system reliability and maintainability.
Responsibilities:
- You will help lead the database and storage strategy at Redfin, including architecture, management, and access patterns
- You will lead complex technical discussions with a variety of audiences, including software and systems engineers and business leaders
- You will architect & lead implementation of cloud database and storage systems with a focus on reliability, observability, scalability, and security
- You will support large scale / high volume databases both as self-managed and specialized AWS managed offerings, including management activities, such as upgrade, backup, recovery, and migration
- You will use and evangelize approved AI code generation tools to document, architect, and create code
- You will plan and participate in high availability and disaster recovery planning/drills
- You will lead incident resolution, including performing root causes analyses
- You will use your systems knowledge to promote scaling and performance for services across Redfin and some partner companies
- You will participate in an on-call rotation for about one week per month
Requirements:
- 7+ years of experience managing systems in AWS or a similar cloud environment, including compute and storage with an emphasis on solution development and execution
- 5+ years of experience with at least one, but preferably more, of the following: PostgreSQL or similar RDBMS; AWS Aurora/RDS; AWS S3; Elasticache; Opensearch; DynamoDB
- Proven history in architecting, building, scaling, and supporting cloud infrastructure technologies, specializing in database and storage services and can communicate the direct business impact of this work
- Extensive experience with Linux administration and Linux scripting, including Python script development
- Experienced mentor of other engineers with the ability to guide a team of engineers to identify and implement solutions to difficult problems
- Committed to best practices that set your team up for long-term success, including infrastructure as code, configuration management tooling, and security practices
- Deep knowledge and professional use of at least one AI code generation tool, such as Anthropic Claude Code, GitHub CoPilot, Cursor, or similar to implement key efficiencies for cloud infrastructure
- Excellent communication skills that allow you to connect and influence your immediate team up through senior leadership
- Understand and can implement core reliability principles, including monitoring, alerting, and incident management