Attentive is an AI marketing platform specializing in 1:1 personalization that transforms brand and customer connections. The Senior Site Reliability Engineer will design and implement systems to enhance reliability and incident management while collaborating with various teams to develop best-in-class platforms and services.
Responsibilities:
- Design and deliver high-impact solutions: Design and implement systems that enhance reliability, observability, traceability, and incident management, ensuring the platform scales effectively
- Lead execution on key projects: Take ownership of projects, driving them from discovery through execution
- Partner across teams: Collaborate with engineers from AI/ML, Data, Platform, and Product teams to develop best-in-class platforms and services
- Establish standards and best practices: Define and enforce production standards, processes, and tools to ensure operational excellence
- Champion reliability goals: Advocate for and implement SLIs, SLOs, and other reliability-focused metrics across the engineering organization
- Mentor and knowledge share: Guide and mentor junior team members, fostering technical growth and helping to develop the next generation of engineering leaders
- Innovate and inspire: Drive continuous improvement by bringing creative ideas and challenging the status quo
Requirements:
- 5+ years of experience in Production Engineering, SRE, Platform Engineering, DevOps, Backend Engineering, or similar roles
- Strong coding ability in at least one language (e.g., Golang, Python, Java, Typescript) with the capability to solve complex issues through code
- Experience with cloud-native technologies and Infrastructure-as-Code (e.g. Kubernetes, Terraform, AWS)
- Demonstrated experience delivering medium to large-scale projects that drive meaningful improvements in platform reliability and scalability
- Deep understanding of production reliability concepts, including SLIs, SLOs, and incident management
- Proficient in designing and maintaining CI/CD pipelines, deployment strategies, and release automation to enable fast, safe delivery
- Fluency in frontier AI-assisted development tools and agents (Claude Code, Codex, Cursor, or similar)
- Excellent verbal and written communication skills with the ability to collaborate across technical and non-technical teams
- Familiarity with working in dynamic, reliability-focused production environments