4IR Solutions delivers mission-critical IT/OT infrastructure for industrial customers that can't afford downtime. The Senior Site Reliability Engineer will design complex architectures, manage customer solutions, and ensure the reliability and observability of systems while participating in incident response and automation efforts.
Responsibilities:
- Design complex IT/OT architectures—in cloud and on-prem—that are secure, recoverable, and sized appropriately
- Work directly with customers to understand their environment and estimate effort
- Own customer solutions end-to-end: requirements design build support
- Build or use reusable modules when it makes sense—build bespoke when it doesn't
- Deploy and manage Kubernetes-based infrastructure and stateful applications across diverse customer environments
- Participate in on-call rotation alongside the rest of the team—everyone here supports what we ship
- Own incidents through resolution, then drive root cause analysis that eliminates the class of problem—not just the symptom
- Build the runbooks, alerts, and automation that make the next incident less likely or less painful
- Work with Infrastructure-as-Code tools to provision and manage diverse customer environments
- Implement and maintain GitOps workflows for in-cluster deployments
- Ensure all infrastructure and application changes are declarative and version-controlled
- Automate self-healing and system updates—reduce manual intervention and keep environments current
- Build and maintain monitoring, alerting, and dashboards using Prometheus, Loki, and Grafana
- Define SLIs and SLOs that reflect what actually matters to customers
- Surface real problems, reduce noise, and continually improve reliability and team efficiency
- Contribute to standards, patterns, and processes that make us better—not bureaucracy for its own sake
- Bring the SRE mindset: automate toil, prefer boring/stable systems, and relentlessly improve
Requirements:
- 5+ years in SRE, DevOps, or Infrastructure Engineering
- Strong Kubernetes skills in production environments—you'll troubleshoot real clusters, not just tutorials
- Experience with GitOps tooling (ArgoCD, Rancher Fleet, FluxCD, or similar)
- Solid understanding of Infrastructure-as-Code concepts (Terraform, Pulumi, Crossplane, or similar)
- Real incident response experience—you've been on-call, stayed calm, and fixed things under pressure
- Comfort with heterogeneous environments—every customer site is a little different and you need to adapt
- Clear communication skills—you can write a useful runbook, gather requirements on a customer call, and document what you learned
- Ability to operate in ambiguity—we're building clarity, not waiting for it
- Azure experience (our primary cloud)
- Experience with SUSE ecosystem (SLE Micro, RKE2, Rancher, Longhorn)
- Industrial, manufacturing, or OT environment experience
- Familiarity with Inductive Automation's Ignition platform and MQTT
- Experience in a startup or small-team environment where you wore many hats