Docusign is a leading company in e-signature and contract lifecycle management solutions, serving over 1.5 million customers globally. They are seeking a Senior Site Reliability Engineer to lead reliability initiatives, ensuring the performance and scalability of critical systems while driving automation and improvements in incident response and observability.

Responsibilities:

Design, implement, and operate highly available, scalable services in cloud environments (primarily Azure, with some multi‑cloud scenarios)
Define and evolve SLOs/SLIs, error budgets, and capacity strategies for owned services; use them to guide engineering trade‑offs and release decisions
Analyze patterns in incidents and outages; own long‑term reliability improvements for your domain and contribute to reliability strategy across services
Write high quality code that is easy to maintain and test
Ensure design and architecture is extensible across projects, and participate in technical design and code reviews
Identify operational toil and lead automation efforts to eliminate it—deployment, runbook, and remediation workflows that make incidents rarer and faster to resolve
Develop robust, well‑tested tooling and shared libraries that are adopted across multiple teams
Improve CI/CD pipelines and guardrails to reduce change failure rate while increasing deployment velocity
Design and implement logging, metrics, tracing, and alerting for complex distributed systems; ensure signals are actionable and aligned to business impact
Build and automate tools and solutions for incident impact analysis and effective mitigation
Participate in and often lead incident response for Sev0–Sev2 events: triage, mitigation, coordination, and clear communication
Perform and contribute to blameless post‑incident reviews, root‑cause analysis, and follow‑through on corrective actions
Work with Operations and Incident Command teams during and post incidents to drive excellence in Incident Management Process
Compose and analyze dashboard to highlight areas of the business that need attention and help drive organizational KPI
Create and respond to system generated alerts to maintain system health
Work with Operations and Engineers to fill any gaps in alerting and telemetry
Act as the primary SRE partner for one or more engineering teams—shaping architecture, reviewing designs, and embedding reliability best practices
Mentor and coach other SREs and software engineers on topics such as debugging, observability, incident management, and performance optimization
Contribute to and help standardize SRE practices, runbooks, and production readiness criteria across CPE and product teams
Work with Product Management, collaborators and other developers to understand design requirements and provide estimates for development
Learn and grow in all key technologies in Docusign and be a partner to Eng and Operations teams

Requirements:

8+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering roles with ownership of production systems at scale (or equivalent experience)
Experience coding in at least one modern language (e.g., Go, Python, C#, Java), with the ability to design, implement, test, and debug production‑grade automation and services
Practical experience operating large‑scale services in public cloud (Azure preferred; AWS/GCP acceptable with willingness to learn Azure)
Experience with Linux, networking fundamentals, and common infrastructure components (load balancers, DNS, certificates, queues, caches, databases)
Experience with Observability stacks (e.g., Prometheus/Grafana, OpenTelemetry/Chronicle, centralized logging)
Experience with CI/CD systems and deployment strategies (blue/green, canary, rolling updates)
Experience with incident management and on‑call operations for 24x7 services
Experience in building dashboards and metrics analysis
Strong analytical and problem-solving skills
Experience in high‑availability, regulated, or customer‑facing SaaS environments
Background in reliability practices such as chaos testing, capacity modeling, and performance tuning
Exposure to release management/unified release practices and safe rollout strategies (feature flags, staged rollouts, configuration‑driven changes)
Demonstrated leadership driving cross‑team initiatives: reliability programs, migrations, or major refactors
Strong written and verbal communication skills; ability to explain complex technical topics to both engineers and non‑technical stakeholders

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: