Sezzle is a company on a mission to financially empower the next generation by revolutionizing the shopping experience. They are seeking a Principal Site Reliability Engineer to architect and build scalable infrastructure solutions while driving reliability and operational excellence across their systems.

Responsibilities:

Architect, upgrade, design, and build scalable infrastructure solutions leveraging Kubernetes, AWS, RDS (MySQL/Postgres), and modern distributed patterns
Help drive the infrastructure team’s roadmap, leading us to higher levels of reliability, recoverability, and scalability
Drive capacity planning, benchmarking, and work with the team to stress test our systems, find bottlenecks, and prepare for further growth in the business
Define, maintain and enforce SLAs and alerts across our infrastructure
Lead the teams towards stronger signal anomaly detection, better, more flexible alerting
Help Lead Sezzle’s AI enablement efforts, identifying opportunities to apply AI and automation to enhance infrastructure reliability, developer productivity, and internal tooling
Build in consistency and scalability across a distributed microservices architecture while maintaining performance and reliability
Establish and evolve engineering best practices for observability, security, and CI/CD across teams
Mentor engineers and champion a culture of learning, innovation, and operational excellence
Collaborate cross-functionally to translate business goals into technical roadmaps and deliver results that matter

Requirements:

15+ years of professional software engineering or infrastructure engineering experience, including significant SRE and backend experience
Deployed significant changes to a production application or infrastructure configuration in the past 30 days
Expertise with SQL-based RDBMS (MySQL, PostgreSQL) and experience optimizing schema and queries for performance at scale
Proficiency in observability tools (Prometheus, Grafana, Datadog, New Relic)
Solid understanding of distributed systems design patterns (e.g., transactional outbox, event-driven architecture and stream processing, queues)
Demonstrated ability to bring new ideas forward, influence decisions, and lead complex technical initiatives
Bachelor's degree in Computer Science or equivalent practical experience
Experience with AWS cloud infrastructure, mainly AWS Aurora RDS, both MySQL and Postgres
Experience with data engineering, data pipelines and data warehousing
Experience with CI/CD pipelines and deploying containerized microservices in Kubernetes
Familiarity with AI developer tooling like Claude Code, Gemini CLI, Codex, Cursor and using it to be a more productive engineer
Strong proficiency in Golang, with experience building and maintaining RESTful APIs
Track record of shipping commercial APIs and data-driven applications in high-growth environments
Proven leadership in guiding technical direction, improving system reliability, and scaling high-traffic services

Principal Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: