Sephora is a leading beauty retailer that values diversity and inclusivity. As a Senior Engineer in Site Reliability Engineering, you will ensure stable online experiences for millions of customers by monitoring and optimizing the reliability of Sephora's digital platforms.

Responsibilities:

Ensure Platform Stability. Operate and support the Dotcom and OMNI platform (including BOPIS and Same-Day Delivery), ensuring high availability, resilience, and hyper-stable customer experiences during normal operations and peak traffic events
Lead Incident Response. Triage, diagnose, and resolve L2/L3 production incidents; lead post-incident reviews and partner with engineering teams on permanent corrective actions to eliminate root causes
Drive Intelligent Automation. Build automation solutions, reduce operational toil, and create AI-driven reliability tools and agentic workflows to improve mean time to resolution, productivity, and overall stability
Enhance Observability. Develop and optimize observability through logs, metrics, traces, dashboards, and anomaly detection; refine alerting and telemetry pipelines to proactively identify and resolve issues
Validate Release Readiness. Ensure world-class readiness for releases, seasonal events, feature launches, and traffic spikes through resiliency checks, performance validation, and comprehensive change reviews
Maintain Reliability Standards. Maintain and optimize SLO/SLI frameworks; monitor error budgets and partner with application teams on continuous reliability improvements

Requirements:

6+ years of hands-on SRE, DevOps, or Production Engineering experience in high-scale digital applications, with a strong understanding of reliability principles and operational excellence
Strong exposure to Azure AKS, Kubernetes, Docker, Service Mesh, and API-driven architectures, with operational support experience for React front-end and Spring Boot microservices in production environments
Hands-on experience with observability tools (Dynatrace, Splunk, Grafana, Prometheus) and strong scripting abilities (Python, Bash, PowerShell, YAML) to build automation that reduces toil and improves incident response
Proven experience in incident management, root cause analysis, and implementing permanent corrective actions that drive long-term reliability improvements
Experience with SRE principles, CI/CD pipelines (Jenkins, GitHub Actions), and cloud platforms (Azure required; AWS/GCP/OCI a plus)
Strong analytical and problem-solving abilities with clear communication skills under pressure, a collaborative mindset, and passion for reducing toil while improving developer and operator experiences

Senior Engineer, SRE

Key skills

About this role

Responsibilities:

Requirements: