Harnham is an AI-driven advertising technology company providing customizable algorithmic media buying infrastructure. They are seeking a Principal Site Reliability Engineer to define and own the AWS event-driven platform design, build systems, and establish standards for operational capability over time.

Responsibilities:

Define and own AWS event-driven platform design
Build systems using EventBridge, Kafka/MSK, Kinesis, Lambda, Fargate, SQS/SNS, Step Functions
Establish standards for idempotency, event schema governance, and traceability
Help grow the SRE function and operational capability over time
Operate/optimize multi-cluster production K8s
Manage API server scaling, etc performance, RBAC, admission controllers
Implement GitOps, progressive delivery, cluster level security, multi-tenant isolation
Define Terraform module standards
Build reusable infrastructure primitives
Enforce GitHub guardrails and CI/CD safety
Implement automated infra testing, policy validation, rollback safety
Create containerized model-serving architecture (OCI images)
Define security, isolation, telemetry, upgrade strategy, and runtime contracts
Define SLIs, SLOs, error budgets
Implement distributed tracing and golden signals
Build incident response, on-call structure, escalation standards
Ensure clear boundaries between infra, ML systems, and application teams
SSO, provisioning, Terraform deployment
Cluster policies + Unity Catalog governance

Requirements:

7–10+ years in SRE / Infrastructure / Platform Engineering
Strong software/data engineering crossover
AWS expert
Databricks, Apache Spark, Snowflake experience
Deep experience handling large-scale data (5–8+ TB batches every 4 hours)
Experience building greenfield systems in fast-moving startups
Ad tech or marketing tech background
History of promotions / growth
Has built systems from 0 to 1
Broad technical exposure

Principal Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: