YPO is seeking a Senior Lead DevOps Engineer to design, build, and operate the cloud infrastructure and developer platform for its next generation of products. This hands-on technical leadership role involves working on cloud infrastructure, CI/CD pipelines, release engineering, and platform reliability in support of a rapidly scaling AI-first mobile platform.
Responsibilities:
- Own the architecture and day-to-day operation of YPO's cloud infrastructure across its full lifecycle
- Architect, implement, and continuously evolve YPO's cloud infrastructure across AWS, Azure, and/or GCP — ensuring it is scalable, resilient, cost-efficient, and production-ready for a global AI-first platform
- Design and manage multi-region, highly available environments that meet YPO's performance and uptime requirements for a 35,000+ member global community
- Own cloud cost management and FinOps practices — implementing tagging strategies, reserved capacity planning, and anomaly detection to optimise infrastructure spend without sacrificing reliability
- Lead the evaluation and adoption of new cloud services, platforms, and tooling — making well-reasoned build-vs-buy decisions based on engineering impact and long-term maintainability
- Manage DNS, CDN, load balancing, and networking configurations across cloud environments, ensuring global performance and failover capabilities
- Lead YPO's Infrastructure as Code practice using Terraform as the primary tool, ensuring all infrastructure is version-controlled, reviewed, tested, and deployed through automation — never manually
- Define and enforce IaC standards, module structures, and governance practices across the engineering organisation, ensuring infrastructure code is readable, reusable, and maintainable over time
- Automate environment provisioning, teardown, and configuration management for development, staging, and production environments — enabling engineers to spin up and destroy environments on demand
- Build and maintain automation pipelines for routine operational tasks including certificate rotation, secret rotation, compliance remediation, and infrastructure drift detection
- Write clean, well-tested automation scripts in Python, Bash, or equivalent — treating operational scripts with the same engineering rigour applied to product code
- Design, build, and maintain end-to-end CI/CD pipelines for YPO's mobile (iOS and Android), backend API, AI platform, and data engineering workloads — reducing time-to-deploy and increasing deployment frequency
- Implement branch strategies, environment promotion workflows, and feature flagging patterns that allow teams to ship incrementally and safely to a global production audience
- Integrate automated quality gates — unit tests, integration tests, security scans (SAST/DAST/SCA), container scanning, and IaC linting — as non-negotiable steps in every pipeline
- Lead the adoption of progressive delivery techniques including blue-green deployments, canary releases, and traffic shifting to minimise deployment risk and enable rapid rollback
- Partner with the Lead Security Engineer to embed security and compliance checks into every pipeline stage, ensuring secure-by-default releases across all environments
- Own release documentation, change management workflows, and deployment runbooks — ensuring all production changes are auditable, traceable, and recoverable
- Design, operate, and continuously improve YPO's container orchestration infrastructure using Kubernetes (EKS, AKS, or GKE), ensuring reliable scheduling, resource efficiency, and operational simplicity
- Manage container image governance, including base image standards, image scanning pipelines, registry management, and deprecation policies for outdated or vulnerable images
- Implement and maintain service mesh, ingress controllers, network policies, and inter-service security patterns appropriate for YPO's AI platform and mobile API surfaces
- Evaluate and adopt platform engineering tools that improve developer self-service — internal developer platforms (IDPs), environment-as-a-service patterns, and golden path templates that let engineers provision what they need without DevOps as a bottleneck
- Lead the migration, decomposition, or consolidation of existing services as part of YPO's digital transformation roadmap — balancing technical debt reduction with delivery velocity
- Design and implement a comprehensive observability stack covering metrics, logs, distributed traces, and synthetic monitoring — giving engineering and product teams clear, real-time visibility into system health and member experience quality
- Define and enforce SLOs, SLIs, and error budgets across YPO's platform services, establishing a shared language between product, engineering, and operations for reliability conversations
- Build and maintain dashboards, alerting rules, and on-call runbooks that surface actionable signals — reducing alert fatigue, improving mean time to detect (MTTD), and enabling fast mean time to recover (MTTR)
- Lead blameless post-mortem processes following significant incidents, driving systemic improvements and institutional learning rather than point fixes
- Own capacity planning and performance benchmarking for the AI-first mobile platform, ensuring infrastructure scales proactively ahead of member growth and feature launches
- Partner with the Lead Security Engineer to embed security controls, policy-as-code enforcement, and compliance automation throughout the CI/CD pipeline and infrastructure provisioning lifecycle
- Implement and maintain secrets management solutions (HashiCorp Vault, AWS Secrets Manager, or equivalent) — ensuring no credentials, tokens, or sensitive configuration are ever stored in source code or plaintext
- Enforce cloud security baselines using policy-as-code frameworks (Open Policy Agent, AWS Config Rules, Azure Policy) to detect and auto-remediate configuration drift in real time
- Support SOC 2, ISO 27001, and other compliance programmes by providing infrastructure evidence, automating audit artefact collection, and maintaining clear audit trails for all infrastructure changes
- Manage network security controls including VPCs, security groups, private endpoints, and zero-trust connectivity patterns across cloud environments
- Own the internal developer experience — streamlining local development environments, onboarding workflows, and self-service tooling so that engineers spend their time building product, not fighting infrastructure
- Define and document engineering standards for environment configuration, deployment patterns, and operational runbooks, ensuring institutional knowledge is captured and accessible
- Mentor and up-level junior engineers and platform contributors, building DevOps literacy across the wider engineering organisation and breaking down silos between platform and product teams
- Act as a cross-functional bridge between product, mobile engineering, AI/data engineering, and security — translating competing infrastructure priorities into a coherent, sequenced delivery plan
- Contribute to technology investment decisions with well-reasoned proposals, total-cost-of-ownership analysis, and clear trade-off documentation
Requirements:
- 5+ years of hands-on experience in DevOps, platform engineering, or site reliability engineering, with at least 2 years in a senior or lead capacity
- Deep, demonstrable expertise with at least one major cloud provider (AWS strongly preferred) and solid working knowledge of a second (Azure or GCP)
- Infrastructure as Code proficiency: Terraform is required. Experience with CloudFormation, Pulumi, or CDK is a plus
- CI/CD experience: hands-on design and operation of pipelines using GitHub Actions, GitLab CI, CircleCI, Jenkins, or equivalent tools across multiple workload types
- Strong Kubernetes experience, including cluster management, Helm chart authoring, RBAC, network policies, and workload auto-scaling in a production cloud environment
- Proficiency in Python for automation and tooling; comfort with Bash/shell scripting for operational tasks
- Solid understanding of networking fundamentals — DNS, TCP/IP, TLS, load balancing, CDN, VPC design, and private connectivity patterns
- Experience implementing observability solutions using tools such as Datadog, Grafana, Prometheus, OpenTelemetry, CloudWatch, or equivalent platforms
- Practical knowledge of container security, secrets management, and cloud IAM patterns, with experience working alongside a security engineering function
- Strong communication skills — able to write clear documentation, present technical trade-offs to non-technical stakeholders, and lead engineering conversations with confidence
- Demonstrated ability to operate with autonomy in a fast-moving environment, balancing long-term platform investment with near-term delivery needs
- Experience supporting native iOS and/or Android mobile release pipelines, including code signing, provisioning profile management, App Store / Play Store automation, and mobile-specific testing infrastructure
- Familiarity with AI/ML infrastructure, including model serving platforms, GPU workload scheduling, data pipeline orchestration (Airflow, Prefect, or equivalent), and vector database operations
- Experience with platform engineering tools such as Backstage, Port, or similar internal developer portals
- Exposure to FinOps tooling (Infracost, CloudHealth, Spot.io) and cloud cost optimisation at scale
- Experience with multi-region, active-active deployment architectures and global traffic management
- Prior experience in a global SaaS or membership platform serving diverse geographic markets