4IR Solutions delivers mission-critical IT/OT infrastructure for industrial customers that can't afford downtime. They are seeking a Senior Site Reliability Engineer to design complex architectures, manage Kubernetes-based infrastructure, and improve reliability through automation and observability.
Responsibilities:
- Design complex IT/OT architectures—in cloud and on-prem—that are secure, recoverable, and sized appropriately
- Work directly with customers to understand their environment and estimate effort
- Own customer solutions end-to-end: requirements design build support
- Build or use reusable modules when it makes sense—build bespoke when it doesn't
- Deploy and manage Kubernetes-based infrastructure and stateful applications across diverse customer environments
- Participate in on-call rotation alongside the rest of the team—everyone here supports what we ship
- Own incidents through resolution, then drive root cause analysis that eliminates the class of problem—not just the symptom
- Build the runbooks, alerts, and automation that make the next incident less likely or less painful
- Work with Infrastructure-as-Code tools to provision and manage diverse customer environments
- Implement and maintain GitOps workflows for in-cluster deployments
- Ensure all infrastructure and application changes are declarative and version-controlled
- Automate self-healing and system updates—reduce manual intervention and keep environments current
- Build and maintain monitoring, alerting, and dashboards using Prometheus, Loki, and Grafana
- Define SLIs and SLOs that reflect what actually matters to customers
- Surface real problems, reduce noise, and continually improve reliability and team efficiency
- Contribute to standards, patterns, and processes that make us better—not bureaucracy for its own sake
- Bring the SRE mindset: automate toil, prefer boring/stable systems, and relentlessly improve
Requirements:
- 5+ years in SRE, DevOps, or Infrastructure Engineering
- Strong Kubernetes skills in production environments—you'll troubleshoot real clusters, not just tutorials
- Experience with GitOps tooling (ArgoCD, Rancher Fleet, FluxCD, or similar)
- Solid understanding of Infrastructure-as-Code concepts (Terraform, Pulumi, Crossplane, or similar)
- Real incident response experience—you've been on-call, stayed calm, and fixed things under pressure
- Comfort with heterogeneous environments—every customer site is a little different and you need to adapt
- Clear communication skills—you can write a useful runbook, gather requirements on a customer call, and document what you learned
- Ability to operate in ambiguity—we're building clarity, not waiting for it
- Azure experience (our primary cloud)
- Experience with SUSE ecosystem (SLE Micro, RKE2, Rancher, Longhorn)
- Industrial, manufacturing, or OT environment experience
- Familiarity with Inductive Automation's Ignition platform and MQTT
- Experience in a startup or small-team environment where you wore many hats