AbsenceSoft is transforming the employee experience with secure technology for HR professionals. They are seeking a senior Site Reliability Engineer to manage the reliability, scalability, and security of their AWS production infrastructure for a B2B SaaS platform, while collaborating with cross-functional teams and mentoring junior engineers.
Responsibilities:
- Architect, implement, and operate scalable, resilient, and secure AWS infrastructure — including GuardDuty, Lambda, EventBridge, SNS, SES, S3, ALB, and ECS container workloads
- Lead infrastructure-as-code initiatives to ensure all environments are reproducible, auditable, and consistently configured in support of SOC 2 change management controls
- Design, maintain, and improve CI/CD pipelines using Jenkins and GitHub to enable reliable, repeatable software delivery — partnering with application engineering to reduce release risk and increase deployment frequency
- Own the Datadog observability platform, including dashboards, monitors, alerting thresholds, and log management; define and maintain SLOs, SLIs, and error budgets to guide reliability investment and reduce alert fatigue
- Serve as a senior technical responder across the full incident lifecycle — detection, containment, resolution, and postmortem — within a shared on-call rotation, and lead blameless postmortems to drive down incident frequency and MTTR
- Refine, implement, and test disaster recovery plans to meet RTO/RPO objectives, while contributing to SOC 2 audit readiness with a focus on access controls, incident response, and risk mitigation
- Mentor junior SREs through code reviews, incident pairing, and documentation of runbooks and engineering standards
Requirements:
- 5+ years of experience in SRE, DevOps, or a related engineering role, with advanced hands-on expertise in AWS production environments and core services including Lambda, ECS, S3, ALB, and GuardDuty
- Strong proficiency in infrastructure-as-code tooling such as Terraform, CloudFormation, or CDK, paired with experience building and operating CI/CD pipelines using Jenkins and GitHub
- Proficiency in Python, Go, or Bash for automation, alongside hands-on experience with Datadog or a comparable observability platform for monitoring, alerting, and log management
- Demonstrated experience leading incident response in complex, distributed systems, with working knowledge of SLO/SLI frameworks, error budgets, and disaster recovery planning against defined RTO/RPO objectives
- Familiarity with SOC 2 compliance frameworks and experience contributing to audit readiness, access controls, and security control evidence collection
- A collaborative, ownership-driven mindset with strong communication skills, a passion for mentoring junior engineers, and a commitment to reducing toil through automation and AI-assisted tooling