ECI Software Solutions is seeking a hands-on Site Reliability Engineer to enhance the reliability and performance of their Manufacturing ERP Portfolio. In this role, you will ensure the operational excellence of production systems, collaborate with cross-functional teams, and drive initiatives in incident response, automation, and cost optimization.
Responsibilities:
- Be the guardian of our 24/7 production environments, swiftly responding to incidents, driving root cause analyses, and continuously enhancing uptime, error budgets, and recovery metrics. Your proactive mindset will identify risks before they impact our users
- Design and maintain cutting-edge observability frameworks using tools like Coralogix and FireHydrant. Build intuitive dashboards and fine-tune alerting to ensure our teams have clear, actionable insights without the noise
- Champion GitOps principles and Terraform-driven infrastructure as code. Automate repetitive tasks, streamline CI/CD pipelines, and review pull requests to embed reliability and operational excellence into every deployment
- Drive cloud and infrastructure cost optimization initiatives, balancing performance with budget-conscious decisions. Collaborate on capacity planning and architect solutions that are both reliable and cost-effective
- Work hand-in-hand with cross-functional teams in an Agile environment, contributing to sprint ceremonies, documenting runbooks, and fostering a culture of continuous learning and improvement
Requirements:
- 3–5+ years of hands-on experience in Site Reliability Engineering, DevOps, or Infrastructure roles
- Deep expertise in at least one major cloud platform (AWS, Azure, or GCP)
- Fluency with Linux/Unix systems administration, including kernel internals, networking, file systems, and advanced shell scripting (Bash, Python) for troubleshooting and automation
- Proven experience managing production systems in hybrid cloud and on-premises environments
- Familiarity with GitOps workflows, Terraform, and observability tools
- Active participation in incident response and on-call rotations
- Exceptional troubleshooting, problem-solving, and communication skills
- Bachelor's degree in computer science, Engineering, or related field, or equivalent experience
- Experience with Kubernetes or other Container Services
- Experience supporting high-availability SaaS platforms
- Cloud certifications (AWS, Azure, or Google Cloud)
- Agile/Scrum experience and proficiency with Jira
- Knowledge of FinOps and cost optimization best practices