ActBlue is a nonprofit organization dedicated to creating technology that fuels Democratic victories and enables progressive causes to thrive. They are seeking a Staff Software Engineer to serve as a technical leader on the SRE team, focusing on reliability initiatives, observability strategy, and incident response capabilities across their platform.
Responsibilities:
- Own and drive SRE technical strategy in key domains: observability, incident management, reliability engineering, and platform operations
- Serve as a go-to consultant for infrastructure and reliability concerns across engineering teams
- Lead architecture decisions for monitoring, alerting, and SLO frameworks; contribute to org-wide RFCs
- Provide L2 on-call support for the most complex and urgent incidents; actively build incident response capability across teams
- Draw on past incidents and post-mortems to drive systemic prevention, not just remediation
- Lead multi-quarter SRE initiatives with complex cross-team dependencies (e.g., observability buildout, on-call training program, service ownership registry)
- Remove blockers for teammates; move stalled projects forward
- Define and maintain SLIs/SLOs for tier-1 business flows: contributions, disbursements, compliance reporting
- Contribute to ActBlue's multi-year reliability roadmap; anticipate how upstream team decisions affect SRE goals
- Prefer automation over manual process; reduce toil through tooling and systemic fixes
- Work across team and org lines to build buy-in for reliability investments
- Communicate technical strategy and its business value to non-technical stakeholders
- Document and evangelize SRE practices through Architecture Council, team wikis, and cross-team forums
- Mentor engineers at all levels on observability, incident response, and reliability principles
- Provide high-quality feedback on technical proposals that have reliability implications
- Set the technical standard for the SRE team; model the practices you want the team to adopt
- Contribute to a culture of psychological safety and blameless post-mortems
- Help build an on-call program that is sustainable, well-supported, and continuously improving
Requirements:
- 8+ years of experience in SRE, DevOps, or systems/infrastructure engineering
- Deep expertise in observability tooling (we use Datadog: APM, RUM, DBM, dashboards, SLOs, alerting)
- Strong command of Kubernetes and cloud-native infrastructure (we run EKS via Flux on AWS)
- Experience defining and operating SLIs and SLOs in production environments
- Demonstrated ability to lead cross-functional reliability initiatives and build organizational buy-in
- Strong incident management experience: on-call, post-mortems, blameless culture
- Experience with CDN and edge infrastructure (we use Fastly with VCL)
- Familiarity with Rails monolith operations at scale
- Experience with FinOps / cloud cost accountability
- Exposure to PagerDuty, Jeli, or similar incident tooling