Upstart is a leading AI lending marketplace focused on reducing the cost and complexity of borrowing for all Americans. The Principal Software Engineer on the Site Reliability Engineering team will lead the adoption of SRE principles, drive improvements in system reliability and observability, and collaborate with various engineering teams to enhance operational excellence.

Responsibilities:

Lead the definition, advocacy, and adoption of SRE principles across engineering teams
Partner with leadership to shape long-term reliability, resiliency, and observability strategies
Champion distributed tracing, real user monitoring (RUM), and key performance metrics such as Largest Contentful Paint (LCP) to improve system visibility and user experience
Build and scale self-healing systems to minimize manual intervention and reduce downtime
Drive enterprise-wide improvements to incident response processes, including those related to Machine Learning systems
Collaborate closely with Development Productivity and Quality teams to improve engineering velocity without sacrificing reliability
Influence technical and operational roadmaps through data-driven insights and hands-on technical contributions
Own and deliver cross-functional initiatives from concept through execution, applying program management skills to align stakeholders and achieve results

Requirements:

10+ years combined experience across Software Engineering and Site Reliability Engineering, with a balanced background in both disciplines
Proven track record as an SRE thought leader and evangelist, driving adoption of reliability best practices across organizations
Strong communication and mentoring skills to influence engineers across disciplines
Proficiency in Python, Go, and JavaScript/TypeScript
Proficiency with Infrastructure as Code (Terraform, CDK, CloudFormation, etc.)
Experience building internal tooling from scratch in agile development environments
Expertise with observability, distributed tracing, RUM, LCP, and performance monitoring tools (e.g., Datadog, Prometheus)
Experience with on-call and incident management, including large-scale or ML-related incidents
Strong background in automation and building self-healing systems
Hands-on experience with LLM/GenAI to improve SRE efficiency and processes
Program management skills, including the ability to propose innovative solutions, influence leadership, improve processes, and drive cross-functional projects to completion
Experience with service mesh
Full stack development skills
Experience building or extending observability platforms
Background in Development Productivity or Quality Platforms
Experience in high-scale SaaS, microservice-oriented cloud environments

Principal Software Engineer, Site Reliability

Key skills

About this role

Responsibilities:

Requirements: