AbsenceSoft is transforming the employee experience with secure technology for HR professionals. They are seeking a senior Site Reliability Engineer to manage the reliability, scalability, and security of their AWS production infrastructure for a B2B SaaS platform, while collaborating with cross-functional teams and mentoring junior engineers.

Responsibilities:

Architect, implement, and operate scalable, resilient, and secure AWS infrastructure — including GuardDuty, Lambda, EventBridge, SNS, SES, S3, ALB, and ECS container workloads
Lead infrastructure-as-code initiatives to ensure all environments are reproducible, auditable, and consistently configured in support of SOC 2 change management controls
Design, maintain, and improve CI/CD pipelines using Jenkins and GitHub to enable reliable, repeatable software delivery — partnering with application engineering to reduce release risk and increase deployment frequency
Own the Datadog observability platform, including dashboards, monitors, alerting thresholds, and log management; define and maintain SLOs, SLIs, and error budgets to guide reliability investment and reduce alert fatigue
Serve as a senior technical responder across the full incident lifecycle — detection, containment, resolution, and postmortem — within a shared on-call rotation, and lead blameless postmortems to drive down incident frequency and MTTR
Refine, implement, and test disaster recovery plans to meet RTO/RPO objectives, while contributing to SOC 2 audit readiness with a focus on access controls, incident response, and risk mitigation
Mentor junior SREs through code reviews, incident pairing, and documentation of runbooks and engineering standards

Requirements:

5+ years of experience in SRE, DevOps, or a related engineering role, with advanced hands-on expertise in AWS production environments and core services including Lambda, ECS, S3, ALB, and GuardDuty
Strong proficiency in infrastructure-as-code tooling such as Terraform, CloudFormation, or CDK, paired with experience building and operating CI/CD pipelines using Jenkins and GitHub
Proficiency in Python, Go, or Bash for automation, alongside hands-on experience with Datadog or a comparable observability platform for monitoring, alerting, and log management
Demonstrated experience leading incident response in complex, distributed systems, with working knowledge of SLO/SLI frameworks, error budgets, and disaster recovery planning against defined RTO/RPO objectives
Familiarity with SOC 2 compliance frameworks and experience contributing to audit readiness, access controls, and security control evidence collection
A collaborative, ownership-driven mindset with strong communication skills, a passion for mentoring junior engineers, and a commitment to reducing toil through automation and AI-assisted tooling

Site Reliability Engineer lll

Key skills

About this role

Responsibilities:

Requirements: