ActBlue is a nonprofit organization dedicated to creating technology that fuels Democratic victories and enables progressive causes to thrive. They are seeking a Staff Software Engineer to serve as a technical leader on the SRE team, focusing on reliability initiatives, observability strategy, and incident response capabilities across their platform.

Responsibilities:

Own and drive SRE technical strategy in key domains: observability, incident management, reliability engineering, and platform operations
Serve as a go-to consultant for infrastructure and reliability concerns across engineering teams
Lead architecture decisions for monitoring, alerting, and SLO frameworks; contribute to org-wide RFCs
Provide L2 on-call support for the most complex and urgent incidents; actively build incident response capability across teams
Draw on past incidents and post-mortems to drive systemic prevention, not just remediation
Lead multi-quarter SRE initiatives with complex cross-team dependencies (e.g., observability buildout, on-call training program, service ownership registry)
Remove blockers for teammates; move stalled projects forward
Define and maintain SLIs/SLOs for tier-1 business flows: contributions, disbursements, compliance reporting
Contribute to ActBlue's multi-year reliability roadmap; anticipate how upstream team decisions affect SRE goals
Prefer automation over manual process; reduce toil through tooling and systemic fixes
Work across team and org lines to build buy-in for reliability investments
Communicate technical strategy and its business value to non-technical stakeholders
Document and evangelize SRE practices through Architecture Council, team wikis, and cross-team forums
Mentor engineers at all levels on observability, incident response, and reliability principles
Provide high-quality feedback on technical proposals that have reliability implications
Set the technical standard for the SRE team; model the practices you want the team to adopt
Contribute to a culture of psychological safety and blameless post-mortems
Help build an on-call program that is sustainable, well-supported, and continuously improving

Requirements:

8+ years of experience in SRE, DevOps, or systems/infrastructure engineering
Deep expertise in observability tooling (we use Datadog: APM, RUM, DBM, dashboards, SLOs, alerting)
Strong command of Kubernetes and cloud-native infrastructure (we run EKS via Flux on AWS)
Experience defining and operating SLIs and SLOs in production environments
Demonstrated ability to lead cross-functional reliability initiatives and build organizational buy-in
Strong incident management experience: on-call, post-mortems, blameless culture
Experience with CDN and edge infrastructure (we use Fastly with VCL)
Familiarity with Rails monolith operations at scale
Experience with FinOps / cloud cost accountability
Exposure to PagerDuty, Jeli, or similar incident tooling

Staff Software Engineer, Platform

Key skills

About this role

Responsibilities:

Requirements: