SoTalent is seeking an experienced Lead Site Reliability Engineer (SRE) to drive large-scale initiatives that ensure platform resilience, performance, and security. In this leadership role, you will oversee SRE projects, enhance observability practices, and mentor engineering teams to strengthen system excellence.

Responsibilities:

Oversee full‑lifecycle SRE projects focused on platform reliability, monitoring, and performance optimization
Architect and enhance observability frameworks—including metrics, dashboards, alerts, and KPIs—to maximize visibility
Lead decisions around system validation, testing strategies, service monitoring, and adoption of new reliability tools
Analyze and troubleshoot service disruptions, determine root causes, and drive long‑term reliability improvements
Conduct and lead post‑incident reviews with clear documentation and actionable insights
Improve CI/CD pipelines, deployment workflows, and SDLC processes to increase system stability
Collaborate with engineering and development teams to identify issues, develop scalable solutions, and drive consistent improvements
Mentor and coach technical teams on SRE best practices, new technologies, and operational excellence
Guide and review design documents, code, and test cases to maintain high engineering standards
Stay on top of emerging tools, cloud technologies, and industry trends—and champion their adoption where beneficial
Foster a culture of continuous improvement, reliability, and cross‑team collaboration

Requirements:

Bachelor's degree in computer science, engineering, or a related technical field
5–7+ years of hands‑on experience in SRE, DevOps, infrastructure, or platform engineering roles
Strong experience with Linux/Unix/Windows administrations
Strong experience with observability tools (Splunk, Dynatrace, Elastic, New Relic, Prometheus, Grafana)
Strong experience with CI/CD platforms (Ansible, Jenkins, CloudBees, OpenShift)
Strong experience with public cloud environments (AWS, Azure)
Strong experience with databases (MongoDB, MySQL, Oracle, SQL/PL‑SQL)
Strong experience with version control & deployment tools (GitLab, Bitbucket, Subversion)
Strong experience with ITSM tools like ServiceNow, Atlassian, or BMC
Experience supporting large‑scale, distributed, highly available systems is a strong plus

Lead Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: