ECI Software Solutions is seeking a hands-on Site Reliability Engineer to ensure the reliability, performance, and scalability of their Manufacturing ERP Portfolio. The role involves collaborating with various teams to enhance uptime and incident response while driving automation and operational excellence.
Responsibilities:
- Be the guardian of our 24/7 production environments, swiftly responding to incidents, driving root cause analyses, and continuously enhancing uptime, error budgets, and recovery metrics. Your proactive mindset will identify risks before they impact our users
- Design and maintain cutting-edge observability frameworks using tools like Coralogix and FireHydrant. Build intuitive dashboards and fine-tune alerting to ensure our teams have clear, actionable insights without the noise
- Champion GitOps principles and Terraform-driven infrastructure as code. Automate repetitive tasks, streamline CI/CD pipelines, and review pull requests to embed reliability and operational excellence into every deployment
- Drive cloud and infrastructure cost optimization initiatives, balancing performance with budget-conscious decisions. Collaborate on capacity planning and architect solutions that are both reliable and cost-effective
- Work hand-in-hand with cross-functional teams in an Agile environment, contributing to sprint ceremonies, documenting runbooks, and fostering a culture of continuous learning and improvement
Requirements:
- 3–5+ years of hands-on experience in Site Reliability Engineering, DevOps, or Infrastructure roles
- Deep expertise in at least one major cloud platform (AWS, Azure, or GCP)
- Fluency with Linux/Unix systems administration, including kernel internals, networking, file systems, and advanced shell scripting (Bash, Python) for troubleshooting and automation
- Proven experience managing production systems in hybrid cloud and on-premises environments
- Familiarity with GitOps workflows, Terraform, and observability tools
- Active participation in incident response and on-call rotations
- Exceptional troubleshooting, problem-solving, and communication skills
- Bachelor's degree in computer science, Engineering, or related field, or equivalent experience
- Experience with Kubernetes or other Container Services
- Experience supporting high-availability SaaS platforms
- Cloud certifications (AWS, Azure, or Google Cloud)
- Agile/Scrum experience and proficiency with Jira
- Knowledge of FinOps and cost optimization best practices