Upstart is a leading AI lending marketplace focused on reducing the cost and complexity of borrowing for Americans. The Senior Software Engineer in Site Reliability will enhance the reliability and observability of Upstart's production systems, implementing standards for monitoring and improving incident response practices.
Responsibilities:
- Embody and share SRE principles at Upstart
- Exercise state-of-the-art SRE practices throughout the company
- Uphold a culture of visibility, ownership, and responsibility around service reliability
- Implement standards for monitoring microservices, web apps, mobile apps, databases, Kubernetes clusters, and machine learning platforms, in a fast-paced environment
- Improve incident response practices, both within SRE and throughout the company
- Automate away toil that makes sense to be automated
Requirements:
- Minimum of 6 years combined experience between Software Engineering, Site Reliability, and/or DevOps Engineering including CI/CD, TDD, internal tooling, observability, and other agile development practices
- Proficiency coding Python, Go, JavaScript/TypeScript
- Proficiency with Infrastructure as Code (Terraform, CDK, Cloudformation, etc.)
- Software engineering background with experience building internal tooling from scratch, and other agile development techniques
- Strong software design & architecture skills
- Fundamentally sound with data structures & algorithms
- Experience with on-call and incident management environments
- Experience with observability, monitoring, and reporting tools (e.g., Datadog, Sumologic, , etc.)
- Experience supporting SaaS software in a microservice-oriented cloud environment
- Ability to work with multiple teams for enterprise-wide deliverables
- Data/metrics-driven mindset
- Experience with service mesh
- Full Stack development skills
- Experience building tooling for an observability platform
- Experience leveraging LLM/GenAI to improve SRE efficiency and processes