Liveline Technologies is focused on enhancing manufacturing performance through artificial intelligence, providing real-time process control and predictive assistance. They are seeking a Site Reliability Engineer (SRE) responsible for ensuring the reliability and performance of production services, including infrastructure management, automation, and incident response.

Responsibilities:

Maintain high availability, performance, and security of Liveline’s production stack across AWS and plant/edge environments
Stand up, tune, and maintain Prometheus/Grafana dashboards, alerts, recording rules, and runbooks. Implement logs/traces (e.g., OpenTelemetry) and actionable alerting
Build and manage reproducible infrastructure with Terraform (VPC, IAM, EC2/EKS/ECS, RDS, S3, CloudWatch, CloudTrail). Apply version control, code reviews, and plan/apply workflows
Write Bash and Python scripts and small services to automate operational tasks, health checks, failover routines, backup/restore, and environment bootstrapping
Participate in a follow-the-sun/on-call rotation; triage and resolve incidents, lead initial comms, and produce blameless postmortems with clear corrective actions
Define and instrument SLIs (availability, latency, error rate, freshness), set SLOs with stakeholders, and manage error budgets to guide release velocity and reliability tradeoffs
Support secure, reliable connectivity between factory networks and cloud (site-to-site VPNs, routing, DNS, TLS, private subnets, security groups, network ACLs)
Operate and tune PostgreSQL/TimescaleDB, InfluxDB, or similar time-series/relational stores; manage backups, PITR, replication, partitioning, and performance baselining
Contribute to build/deploy pipelines (e.g., GitHub Actions/GitLab CI), implement canaries/blue-green strategies, and enforce change management and rollback plans
Enforce least-privilege IAM, secret management (AWS Secrets Manager/SSM), encryption, artifact signing, and basic hardening for Linux and Kubernetes workloads
Partner with process/controls engineers to ensure reliable data ingestion from PLCs/industrial gateways (e.g., OPC UA/Modbus), and safe deploys to plant edge nodes
Right-size compute/storage, set budgets/alerts, forecast capacity, and optimize resource utilization without compromising SLOs
Author and maintain runbooks, architecture diagrams, operational playbooks, and disaster recovery procedures

Requirements:

Bachelor's Degree in IT, Computer Science, or Computer Engineering (or equivalent experience)
5+ years of experience in a corporate IT or startup setting
Familiar with containers (Docker) and orchestration (Kubernetes or ECS)
Experience running production workloads, participating in on-call, and writing postmortems
Strong communication skills with the ability to explain tradeoffs to non-SRE stakeholders
Intellectual curiosity, ownership mindset, and bias for automation
Willingness and ability to travel to customer sites and plants, as necessary
Kubernetes (EKS), Helm, Kustomize
Service Mesh/Ingress (Envoy, NGINX, ALB)
Logging/Tracing: OpenSearch/ELK, Loki, OpenTelemetry
Config Management: Ansible
Secrets & PKI: HashiCorp Vault, mTLS
Edge/Industrial Protocols: OPC UA, Modbus, MQTT; experience with industrial gateways
Compliance exposure (SOC 2, ISO 27001) and change management (ITIL)

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: