Role Overview

Designing, building, and maintaining AWS infrastructure using Terraform (EC2, RDS, S3, SQS, Lambda, ALB, ElastiCache, Route 53, VPC networking)
Writing and maintaining Puppet modules to configure and manage fleets of EC2 instances across multiple auto-scaling groups
Maintaining and extending Python-based automation and tooling that supports platform operations
Operating and improving distributed service discovery and configuration management (etcd)
Managing and tuning a multi-tier caching strategy (Varnish, Redis/Valkey, PHP OPcache)
Running and scaling our observability stack (Prometheus, Grafana, Loki, Fluentd, PagerDuty) and participating in on-call rotations
Evaluating and implementing distributed storage solutions as the platform evolves
Improving deployment workflows and release processes
Collaborating with internal teams on API contracts, integration patterns, and operational tooling
Participating in incident response, root cause analysis, and platform reliability improvements

Requirements

Strong experience with AWS services in production — particularly EC2, RDS, S3, SQS, Lambda, ALB, ElastiCache, Route 53, IAM, and VPC networking
Proficiency in authoring and maintaining Terraform modules for production infrastructure
Proficiency in authoring and maintaining Puppet modules (or equivalent agent-based configuration management) for fleet management
Solid Python skills — you'll be writing and maintaining production daemons, not just scripts
Deep Linux systems knowledge (Ubuntu) — comfortable with Apache/Nginx, PHP-FPM, Varnish, systemd, filesystem mounts, and networking fundamentals
Understanding of distributed systems concepts: consensus, leader election, distributed locking, eventual consistency, and the tradeoffs involved
Proficiency in building and maintaining observability pipelines (Prometheus, Grafana, Loki, or equivalent) in production
Comfortable working in a GitLab-based CI/CD workflow
Clear communicator who can document architectural decisions and explain technical tradeoffs to both technical and non-technical stakeholders.
Hands-on experience with distributed storage systems such as Ceph, GlusterFS, JuiceFS, CubeFS, or AWS EFS — particularly in the context of migration or evaluation
Familiarity with etcd (or similar distributed key-value stores like Consul or ZooKeeper) including watch APIs, TTL-based locking, and cluster operations
Experience with Varnish and VCL, especially dynamic backend routing or multi-tenant configurations
Working knowledge of PHP — not to build applications, but to understand and maintain integration scripts that bridge infrastructure and application layers
Background in multi-tenant SaaS platform design — particularly database-per-tenant models on shared infrastructure
Familiarity with Moodle LMS or education technology platforms
Experience with secrets management solutions (AWS Secrets Manager, HashiCorp Vault, Parameter Store) and automated credential rotation
Experience designing zero-downtime deployment strategies for VM-based (non-containerized) environments.

Tech Stack

Apache
AWS
Consul
Distributed Systems
EC2
Grafana
Linux
NGINX
PHP
Prometheus
Puppet
Python
Redis
Terraform
Vault
Zookeeper

Benefits

Open LMS is an equal employment opportunity/affirmative action employer and considers qualified applicants for employment without regard to race, gender, age, color, religion, national origin, marital status, disability, sexual orientation, or any other protected factor.

Cloud Infrastructure Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits