4IR Solutions delivers mission-critical IT/OT infrastructure for industrial customers that can't afford downtime. The Senior Site Reliability Engineer will design complex architectures, manage customer solutions, and ensure the reliability and observability of systems while participating in incident response and automation efforts.

Responsibilities:

Design complex IT/OT architectures—in cloud and on-prem—that are secure, recoverable, and sized appropriately
Work directly with customers to understand their environment and estimate effort
Own customer solutions end-to-end: requirements design build support
Build or use reusable modules when it makes sense—build bespoke when it doesn't
Deploy and manage Kubernetes-based infrastructure and stateful applications across diverse customer environments
Participate in on-call rotation alongside the rest of the team—everyone here supports what we ship
Own incidents through resolution, then drive root cause analysis that eliminates the class of problem—not just the symptom
Build the runbooks, alerts, and automation that make the next incident less likely or less painful
Work with Infrastructure-as-Code tools to provision and manage diverse customer environments
Implement and maintain GitOps workflows for in-cluster deployments
Ensure all infrastructure and application changes are declarative and version-controlled
Automate self-healing and system updates—reduce manual intervention and keep environments current
Build and maintain monitoring, alerting, and dashboards using Prometheus, Loki, and Grafana
Define SLIs and SLOs that reflect what actually matters to customers
Surface real problems, reduce noise, and continually improve reliability and team efficiency
Contribute to standards, patterns, and processes that make us better—not bureaucracy for its own sake
Bring the SRE mindset: automate toil, prefer boring/stable systems, and relentlessly improve

Requirements:

5+ years in SRE, DevOps, or Infrastructure Engineering
Strong Kubernetes skills in production environments—you'll troubleshoot real clusters, not just tutorials
Experience with GitOps tooling (ArgoCD, Rancher Fleet, FluxCD, or similar)
Solid understanding of Infrastructure-as-Code concepts (Terraform, Pulumi, Crossplane, or similar)
Real incident response experience—you've been on-call, stayed calm, and fixed things under pressure
Comfort with heterogeneous environments—every customer site is a little different and you need to adapt
Clear communication skills—you can write a useful runbook, gather requirements on a customer call, and document what you learned
Ability to operate in ambiguity—we're building clarity, not waiting for it
Azure experience (our primary cloud)
Experience with SUSE ecosystem (SLE Micro, RKE2, Rancher, Longhorn)
Industrial, manufacturing, or OT environment experience
Familiarity with Inductive Automation's Ignition platform and MQTT
Experience in a startup or small-team environment where you wore many hats

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: