SoTalent is seeking an experienced Lead Site Reliability Engineer (SRE) to drive large-scale initiatives that ensure platform resilience, performance, and security. In this leadership role, you will oversee SRE projects, enhance observability practices, and mentor engineering teams to strengthen system excellence.
Responsibilities:
- Oversee full‑lifecycle SRE projects focused on platform reliability, monitoring, and performance optimization
- Architect and enhance observability frameworks—including metrics, dashboards, alerts, and KPIs—to maximize visibility
- Lead decisions around system validation, testing strategies, service monitoring, and adoption of new reliability tools
- Analyze and troubleshoot service disruptions, determine root causes, and drive long‑term reliability improvements
- Conduct and lead post‑incident reviews with clear documentation and actionable insights
- Improve CI/CD pipelines, deployment workflows, and SDLC processes to increase system stability
- Collaborate with engineering and development teams to identify issues, develop scalable solutions, and drive consistent improvements
- Mentor and coach technical teams on SRE best practices, new technologies, and operational excellence
- Guide and review design documents, code, and test cases to maintain high engineering standards
- Stay on top of emerging tools, cloud technologies, and industry trends—and champion their adoption where beneficial
- Foster a culture of continuous improvement, reliability, and cross‑team collaboration
Requirements:
- Bachelor's degree in computer science, engineering, or a related technical field
- 5–7+ years of hands‑on experience in SRE, DevOps, infrastructure, or platform engineering roles
- Strong experience with Linux/Unix/Windows administrations
- Strong experience with observability tools (Splunk, Dynatrace, Elastic, New Relic, Prometheus, Grafana)
- Strong experience with CI/CD platforms (Ansible, Jenkins, CloudBees, OpenShift)
- Strong experience with public cloud environments (AWS, Azure)
- Strong experience with databases (MongoDB, MySQL, Oracle, SQL/PL‑SQL)
- Strong experience with version control & deployment tools (GitLab, Bitbucket, Subversion)
- Strong experience with ITSM tools like ServiceNow, Atlassian, or BMC
- Experience supporting large‑scale, distributed, highly available systems is a strong plus