Sephora is a leading beauty retailer that values diversity and inclusivity. As a Senior Engineer in Site Reliability Engineering, you will ensure stable online experiences for millions of customers by monitoring and optimizing the reliability of Sephora's digital platforms.
Responsibilities:
- Ensure Platform Stability. Operate and support the Dotcom and OMNI platform (including BOPIS and Same-Day Delivery), ensuring high availability, resilience, and hyper-stable customer experiences during normal operations and peak traffic events
- Lead Incident Response. Triage, diagnose, and resolve L2/L3 production incidents; lead post-incident reviews and partner with engineering teams on permanent corrective actions to eliminate root causes
- Drive Intelligent Automation. Build automation solutions, reduce operational toil, and create AI-driven reliability tools and agentic workflows to improve mean time to resolution, productivity, and overall stability
- Enhance Observability. Develop and optimize observability through logs, metrics, traces, dashboards, and anomaly detection; refine alerting and telemetry pipelines to proactively identify and resolve issues
- Validate Release Readiness. Ensure world-class readiness for releases, seasonal events, feature launches, and traffic spikes through resiliency checks, performance validation, and comprehensive change reviews
- Maintain Reliability Standards. Maintain and optimize SLO/SLI frameworks; monitor error budgets and partner with application teams on continuous reliability improvements
Requirements:
- 6+ years of hands-on SRE, DevOps, or Production Engineering experience in high-scale digital applications, with a strong understanding of reliability principles and operational excellence
- Strong exposure to Azure AKS, Kubernetes, Docker, Service Mesh, and API-driven architectures, with operational support experience for React front-end and Spring Boot microservices in production environments
- Hands-on experience with observability tools (Dynatrace, Splunk, Grafana, Prometheus) and strong scripting abilities (Python, Bash, PowerShell, YAML) to build automation that reduces toil and improves incident response
- Proven experience in incident management, root cause analysis, and implementing permanent corrective actions that drive long-term reliability improvements
- Experience with SRE principles, CI/CD pipelines (Jenkins, GitHub Actions), and cloud platforms (Azure required; AWS/GCP/OCI a plus)
- Strong analytical and problem-solving abilities with clear communication skills under pressure, a collaborative mindset, and passion for reducing toil while improving developer and operator experiences