Sight Machine is a company focused on enhancing manufacturing processes through innovative data integration and visualization. As an Infrastructure Engineer, you will own and evolve the cloud infrastructure, design CI/CD pipelines, and implement AI-assisted automation to improve operational efficiency.

Responsibilities:

Owning and evolving our Kubernetes-based cloud infrastructure across Azure and other providers, including fleet management, networking, and cluster operations at scale
Designing and implementing CI/CD pipelines that let the engineering team ship faster and with more confidence, including automated testing, progressive delivery, and rollback capability
Building AI-assisted automation for operational tasks: runbook generation, anomaly triage, alerting logic, and anywhere else we can eliminate repetitive human intervention without sacrificing control
Driving Infrastructure as Code discipline across the platform (Terraform, Helm, FluxCD) so that every environment is reproducible, auditable, and fast to recover
Building and maintaining monitoring and observability infrastructure that gives the team real signal across our stack, from container health to database performance to customer-facing SLAs
Participating in on-call rotation and using every incident as a forcing function to improve the system: better runbooks, better alerting, better automation
Collaborating closely with Development Engineering to close the gap between what gets built and what gets operated well in production

Requirements:

5+ years of professional infrastructure or DevOps engineering experience, with at least some of that at meaningful scale in a cloud-native environment
Deep hands-on experience with Kubernetes and Docker in at least one major cloud provider (Azure, GCP, AWS). You have run clusters in production and have the scars to prove it
Strong IaC fluency with Terraform, Helm, FluxCD, or similar. You write infrastructure the way developers write code: versioned, reviewed, and tested
Real fluency with AI development tools. Not just autocomplete. You have used AI to write automation scripts, draft runbooks, accelerate incident triage, or build internal tooling. Show us how it has actually changed your output
Solid coding ability in at least one scripting or systems language (Python, Go, or similar). You write tools, not just configs
Strong Linux fundamentals and a working knowledge of networking: TCP/IP, DNS, load balancing, and how things break when they should not
Experience with monitoring and alerting stacks: Prometheus, Sentry, Opsgenie, or equivalent. You build observability that gives people real signal, not noise
A track record of on-call participation and a philosophy around incident response that leads to improvement, not just resolution
Clear, direct communication. You can write a postmortem, a runbook, or a design doc that people actually read
A bias for action. You have made decisions under uncertainty, taken the risk, and adjusted when you were wrong. Endless planning is not your style
Familiarity with our current stack: Kubernetes, FluxCD, Terraform, Helm, Prometheus, Elasticsearch, Kafka, PostgreSQL, Jenkins
Experience with Python and Java in the context of platform tooling or automation
Prior work in industrial IoT, manufacturing, or operational technology environments
Experience managing infrastructure for multi-tenant SaaS platforms
An active GitHub or open-source presence that shows how you approach technical problems when no one is watching

Infrastructure Engineer

Key skills

About this role

Responsibilities:

Requirements: