Docusign is a leading company in e-signature and contract lifecycle management solutions, serving over 1.5 million customers globally. They are seeking a Senior Site Reliability Engineer to lead reliability initiatives, ensuring the performance and scalability of critical systems while driving automation and improvements in incident response and observability.
Responsibilities:
- Design, implement, and operate highly available, scalable services in cloud environments (primarily Azure, with some multi‑cloud scenarios)
- Define and evolve SLOs/SLIs, error budgets, and capacity strategies for owned services; use them to guide engineering trade‑offs and release decisions
- Analyze patterns in incidents and outages; own long‑term reliability improvements for your domain and contribute to reliability strategy across services
- Write high quality code that is easy to maintain and test
- Ensure design and architecture is extensible across projects, and participate in technical design and code reviews
- Identify operational toil and lead automation efforts to eliminate it—deployment, runbook, and remediation workflows that make incidents rarer and faster to resolve
- Develop robust, well‑tested tooling and shared libraries that are adopted across multiple teams
- Improve CI/CD pipelines and guardrails to reduce change failure rate while increasing deployment velocity
- Design and implement logging, metrics, tracing, and alerting for complex distributed systems; ensure signals are actionable and aligned to business impact
- Build and automate tools and solutions for incident impact analysis and effective mitigation
- Participate in and often lead incident response for Sev0–Sev2 events: triage, mitigation, coordination, and clear communication
- Perform and contribute to blameless post‑incident reviews, root‑cause analysis, and follow‑through on corrective actions
- Work with Operations and Incident Command teams during and post incidents to drive excellence in Incident Management Process
- Compose and analyze dashboard to highlight areas of the business that need attention and help drive organizational KPI
- Create and respond to system generated alerts to maintain system health
- Work with Operations and Engineers to fill any gaps in alerting and telemetry
- Act as the primary SRE partner for one or more engineering teams—shaping architecture, reviewing designs, and embedding reliability best practices
- Mentor and coach other SREs and software engineers on topics such as debugging, observability, incident management, and performance optimization
- Contribute to and help standardize SRE practices, runbooks, and production readiness criteria across CPE and product teams
- Work with Product Management, collaborators and other developers to understand design requirements and provide estimates for development
- Learn and grow in all key technologies in Docusign and be a partner to Eng and Operations teams
Requirements:
- 8+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering roles with ownership of production systems at scale (or equivalent experience)
- Experience coding in at least one modern language (e.g., Go, Python, C#, Java), with the ability to design, implement, test, and debug production‑grade automation and services
- Practical experience operating large‑scale services in public cloud (Azure preferred; AWS/GCP acceptable with willingness to learn Azure)
- Experience with Linux, networking fundamentals, and common infrastructure components (load balancers, DNS, certificates, queues, caches, databases)
- Experience with Observability stacks (e.g., Prometheus/Grafana, OpenTelemetry/Chronicle, centralized logging)
- Experience with CI/CD systems and deployment strategies (blue/green, canary, rolling updates)
- Experience with incident management and on‑call operations for 24x7 services
- Experience in building dashboards and metrics analysis
- Strong analytical and problem-solving skills
- Experience in high‑availability, regulated, or customer‑facing SaaS environments
- Background in reliability practices such as chaos testing, capacity modeling, and performance tuning
- Exposure to release management/unified release practices and safe rollout strategies (feature flags, staged rollouts, configuration‑driven changes)
- Demonstrated leadership driving cross‑team initiatives: reliability programs, migrations, or major refactors
- Strong written and verbal communication skills; ability to explain complex technical topics to both engineers and non‑technical stakeholders