Harnham is an AI-driven advertising technology company providing customizable algorithmic media buying infrastructure. They are seeking a Principal Site Reliability Engineer to define and own the AWS event-driven platform design, build systems, and establish standards for operational capability over time.
Responsibilities:
- Define and own AWS event-driven platform design
- Build systems using EventBridge, Kafka/MSK, Kinesis, Lambda, Fargate, SQS/SNS, Step Functions
- Establish standards for idempotency, event schema governance, and traceability
- Help grow the SRE function and operational capability over time
- Operate/optimize multi-cluster production K8s
- Manage API server scaling, etc performance, RBAC, admission controllers
- Implement GitOps, progressive delivery, cluster level security, multi-tenant isolation
- Define Terraform module standards
- Build reusable infrastructure primitives
- Enforce GitHub guardrails and CI/CD safety
- Implement automated infra testing, policy validation, rollback safety
- Create containerized model-serving architecture (OCI images)
- Define security, isolation, telemetry, upgrade strategy, and runtime contracts
- Define SLIs, SLOs, error budgets
- Implement distributed tracing and golden signals
- Build incident response, on-call structure, escalation standards
- Ensure clear boundaries between infra, ML systems, and application teams
- SSO, provisioning, Terraform deployment
- Cluster policies + Unity Catalog governance
Requirements:
- 7–10+ years in SRE / Infrastructure / Platform Engineering
- Strong software/data engineering crossover
- AWS expert
- Databricks, Apache Spark, Snowflake experience
- Deep experience handling large-scale data (5–8+ TB batches every 4 hours)
- Experience building greenfield systems in fast-moving startups
- Ad tech or marketing tech background
- History of promotions / growth
- Has built systems from 0 to 1
- Broad technical exposure