Alpaca is a US-headquartered self-clearing broker-dealer and brokerage infrastructure company. The role of Staff Site Reliability Engineer, Database involves ensuring the reliability, scalability, and performance of systems and services while collaborating with development and operations teams to maintain robust applications.
Responsibilities:
- Triage difficult technical problems and implement solutions
- Improve our observability stack (monitoring, logging, profiling)
- Incident Management: Respond to and resolve incidents in a timely manner, conducting post-incident reviews to identify and implement improvements
- Collaboration: Work closely with development teams to ensure new features and services are designed with reliability and scalability in mind
- Capacity Planning: Monitor system capacity and performance, making recommendations and implementing changes to handle future growth
Requirements:
- 5+ years of experience in Site Reliability Engineering, Performance Engineering, or similar roles
- 5+ years of experience with multi-terabyte scale PostgreSQL clusters
- Proven track record of managing and maintaining large-scale, high-availability, and high-performance PostgreSQL database
- Experience designing and implementing SLIs, SLOs, and SLAs for internal systems and databases
- Experience with troubleshooting PostgreSQL performance problems and slow queries
- Extensive experience with efficient schema design and efficient query design
- Experience migrating multi-terabyte tables into more efficient schemas
- Proficient with Go
- Proficient with Prometheus
- Proficient with Linux
- Knowledgeable in trading/fintech domains
- Experience with low-latency systems
- Experience with distributed tracing
- Experience scaling PostgreSQL clusters rapidly
- Experience with pgx, gorm, or sqlc