ECI Software Solutions is seeking a hands-on Site Reliability Engineer to enhance the reliability and performance of their Manufacturing ERP Portfolio. In this role, you will ensure the operational excellence of production systems, collaborate with cross-functional teams, and drive initiatives in incident response, automation, and cost optimization.

Responsibilities:

Be the guardian of our 24/7 production environments, swiftly responding to incidents, driving root cause analyses, and continuously enhancing uptime, error budgets, and recovery metrics. Your proactive mindset will identify risks before they impact our users
Design and maintain cutting-edge observability frameworks using tools like Coralogix and FireHydrant. Build intuitive dashboards and fine-tune alerting to ensure our teams have clear, actionable insights without the noise
Champion GitOps principles and Terraform-driven infrastructure as code. Automate repetitive tasks, streamline CI/CD pipelines, and review pull requests to embed reliability and operational excellence into every deployment
Drive cloud and infrastructure cost optimization initiatives, balancing performance with budget-conscious decisions. Collaborate on capacity planning and architect solutions that are both reliable and cost-effective
Work hand-in-hand with cross-functional teams in an Agile environment, contributing to sprint ceremonies, documenting runbooks, and fostering a culture of continuous learning and improvement

Requirements:

3–5+ years of hands-on experience in Site Reliability Engineering, DevOps, or Infrastructure roles
Deep expertise in at least one major cloud platform (AWS, Azure, or GCP)
Fluency with Linux/Unix systems administration, including kernel internals, networking, file systems, and advanced shell scripting (Bash, Python) for troubleshooting and automation
Proven experience managing production systems in hybrid cloud and on-premises environments
Familiarity with GitOps workflows, Terraform, and observability tools
Active participation in incident response and on-call rotations
Exceptional troubleshooting, problem-solving, and communication skills
Bachelor's degree in computer science, Engineering, or related field, or equivalent experience
Experience with Kubernetes or other Container Services
Experience supporting high-availability SaaS platforms
Cloud certifications (AWS, Azure, or Google Cloud)
Agile/Scrum experience and proficiency with Jira
Knowledge of FinOps and cost optimization best practices

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: