ECI Software Solutions is seeking a Site Reliability Engineering (SRE) Ops Team Lead to ensure the stability and operational excellence of their production systems. The role involves leading incident response, optimizing performance, and collaborating with various teams to maintain service reliability.

Responsibilities:

Take full ownership of the day-to-day operation and support of our always-on production systems
Lead high-stakes incident response with precision—triage, coordinate, communicate, and resolve
Drive impactful post-incident reviews that fuel continuous operational improvements and prevent future issues
Enforce runbooks, SOPs, and escalation paths to keep operations smooth and predictable
Monitor and elevate uptime, SLIs, SLOs, error budgets, and MTTR with a relentless focus on results
Oversee and optimize on-call rotations and operational readiness to keep the team sharp
Own the monitoring and alerting landscape to ensure no critical signal goes unnoticed
Master and optimize observability platforms like Coralogix to extract actionable insights and improve alert quality
Refine alerting strategies and incident workflows using Coralogix and FireHydrant to reduce noise and boost response effectiveness
Build and maintain real-time dashboards that provide crystal-clear visibility into service health
Champion automation initiatives that slash manual toil and accelerate incident response
Implement GitOps practices to deliver consistent, auditable, and reliable operational changes
Contribute to Terraform-driven infrastructure with a sharp eye on operability and maintainability
Review changes rigorously to ensure operational impact, resiliency, and supportability are top-notch
Lead the charge on operational cost awareness and drive initiatives to optimize spend through right-sizing and waste reduction
Partner on capacity planning and demand forecasting to guarantee system stability under all conditions
Make smart trade-offs balancing cost, performance, and reliability in every operational decision
Inspire and mentor SRE team members with hands-on leadership and operational expertise
Be the go-to escalation point for production issues and operational challenges
Collaborate seamlessly across Product, Development, Infrastructure, and Support teams to ensure flawless service delivery
Cultivate a culture of operational discipline, accountability, and relentless improvement
Engage actively in Agile ceremonies and manage operational workflows through Jira

Requirements:

Deep hands-on experience in production operations, SRE, DevOps, or Infrastructure roles
Proven success operating and supporting production systems in hybrid cloud and on-prem environments
Expertise in incident management, on-call best practices, and operational processes
Proficiency with GitOps workflows, Terraform, and observability tools
Strong communication skills with the ability to lead confidently during high-pressure incidents and coordinate cross-team efforts
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
5+ years in SRE, DevOps, Infrastructure, or Production Operations roles
Cloud certifications such as AWS, Azure, or Google Cloud
Experience in Agile/Scrum environments and Jira-based work management
Background supporting high-availability, customer-facing SaaS platforms

Site Reliability Engineering Operations Team Lead

Key skills

About this role

Responsibilities:

Requirements: