ECI Software Solutions is seeking a Site Reliability Engineering (SRE) Ops Team Lead to ensure the stability and operational excellence of their production systems. The role involves leading incident response, optimizing performance, and collaborating with various teams to maintain service reliability.
Responsibilities:
- Take full ownership of the day-to-day operation and support of our always-on production systems
- Lead high-stakes incident response with precision—triage, coordinate, communicate, and resolve
- Drive impactful post-incident reviews that fuel continuous operational improvements and prevent future issues
- Enforce runbooks, SOPs, and escalation paths to keep operations smooth and predictable
- Monitor and elevate uptime, SLIs, SLOs, error budgets, and MTTR with a relentless focus on results
- Oversee and optimize on-call rotations and operational readiness to keep the team sharp
- Own the monitoring and alerting landscape to ensure no critical signal goes unnoticed
- Master and optimize observability platforms like Coralogix to extract actionable insights and improve alert quality
- Refine alerting strategies and incident workflows using Coralogix and FireHydrant to reduce noise and boost response effectiveness
- Build and maintain real-time dashboards that provide crystal-clear visibility into service health
- Champion automation initiatives that slash manual toil and accelerate incident response
- Implement GitOps practices to deliver consistent, auditable, and reliable operational changes
- Contribute to Terraform-driven infrastructure with a sharp eye on operability and maintainability
- Review changes rigorously to ensure operational impact, resiliency, and supportability are top-notch
- Lead the charge on operational cost awareness and drive initiatives to optimize spend through right-sizing and waste reduction
- Partner on capacity planning and demand forecasting to guarantee system stability under all conditions
- Make smart trade-offs balancing cost, performance, and reliability in every operational decision
- Inspire and mentor SRE team members with hands-on leadership and operational expertise
- Be the go-to escalation point for production issues and operational challenges
- Collaborate seamlessly across Product, Development, Infrastructure, and Support teams to ensure flawless service delivery
- Cultivate a culture of operational discipline, accountability, and relentless improvement
- Engage actively in Agile ceremonies and manage operational workflows through Jira
Requirements:
- Deep hands-on experience in production operations, SRE, DevOps, or Infrastructure roles
- Proven success operating and supporting production systems in hybrid cloud and on-prem environments
- Expertise in incident management, on-call best practices, and operational processes
- Proficiency with GitOps workflows, Terraform, and observability tools
- Strong communication skills with the ability to lead confidently during high-pressure incidents and coordinate cross-team efforts
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
- 5+ years in SRE, DevOps, Infrastructure, or Production Operations roles
- Cloud certifications such as AWS, Azure, or Google Cloud
- Experience in Agile/Scrum environments and Jira-based work management
- Background supporting high-availability, customer-facing SaaS platforms