Liveline Technologies is focused on enhancing manufacturing performance through artificial intelligence, providing real-time process control and predictive assistance. They are seeking a Site Reliability Engineer (SRE) responsible for ensuring the reliability and performance of production services, including infrastructure management, automation, and incident response.
Responsibilities:
- Maintain high availability, performance, and security of Liveline’s production stack across AWS and plant/edge environments
- Stand up, tune, and maintain Prometheus/Grafana dashboards, alerts, recording rules, and runbooks. Implement logs/traces (e.g., OpenTelemetry) and actionable alerting
- Build and manage reproducible infrastructure with Terraform (VPC, IAM, EC2/EKS/ECS, RDS, S3, CloudWatch, CloudTrail). Apply version control, code reviews, and plan/apply workflows
- Write Bash and Python scripts and small services to automate operational tasks, health checks, failover routines, backup/restore, and environment bootstrapping
- Participate in a follow-the-sun/on-call rotation; triage and resolve incidents, lead initial comms, and produce blameless postmortems with clear corrective actions
- Define and instrument SLIs (availability, latency, error rate, freshness), set SLOs with stakeholders, and manage error budgets to guide release velocity and reliability tradeoffs
- Support secure, reliable connectivity between factory networks and cloud (site-to-site VPNs, routing, DNS, TLS, private subnets, security groups, network ACLs)
- Operate and tune PostgreSQL/TimescaleDB, InfluxDB, or similar time-series/relational stores; manage backups, PITR, replication, partitioning, and performance baselining
- Contribute to build/deploy pipelines (e.g., GitHub Actions/GitLab CI), implement canaries/blue-green strategies, and enforce change management and rollback plans
- Enforce least-privilege IAM, secret management (AWS Secrets Manager/SSM), encryption, artifact signing, and basic hardening for Linux and Kubernetes workloads
- Partner with process/controls engineers to ensure reliable data ingestion from PLCs/industrial gateways (e.g., OPC UA/Modbus), and safe deploys to plant edge nodes
- Right-size compute/storage, set budgets/alerts, forecast capacity, and optimize resource utilization without compromising SLOs
- Author and maintain runbooks, architecture diagrams, operational playbooks, and disaster recovery procedures
Requirements:
- Bachelor's Degree in IT, Computer Science, or Computer Engineering (or equivalent experience)
- 5+ years of experience in a corporate IT or startup setting
- Familiar with containers (Docker) and orchestration (Kubernetes or ECS)
- Experience running production workloads, participating in on-call, and writing postmortems
- Strong communication skills with the ability to explain tradeoffs to non-SRE stakeholders
- Intellectual curiosity, ownership mindset, and bias for automation
- Willingness and ability to travel to customer sites and plants, as necessary
- Kubernetes (EKS), Helm, Kustomize
- Service Mesh/Ingress (Envoy, NGINX, ALB)
- Logging/Tracing: OpenSearch/ELK, Loki, OpenTelemetry
- Config Management: Ansible
- Secrets & PKI: HashiCorp Vault, mTLS
- Edge/Industrial Protocols: OPC UA, Modbus, MQTT; experience with industrial gateways
- Compliance exposure (SOC 2, ISO 27001) and change management (ITIL)