Equinix is the world’s digital infrastructure company®, enabling innovations that enrich our work, life, and planet. The Principal Platform Engineer will lead the technical vision for observability and reliability standards across Equinix's global hybrid infrastructure, ensuring secure and scalable delivery of digital products.
Responsibilities:
- Interacts with internal product management and engineering teams to understand product requirements and define the platform roadmap
- Works with the Equinix Engineering Excellence (E3) team in the Equinix IT organization to find common points of acceleration and bidirectional consumption of services
- Acts as a lead representative for Infrastructure P&S requirements in forums for enterprise-wide developer initiatives, plans, and architectures
- Defines the platform reliability standards through the development of a comprehensive SLO/SLI framework
- Drives architectural consistency for observability across a hybrid footprint including 31 metros and multiple AWS regions
- Consolidates all application observability signals onto a single platform (Grafana Cloud) to provide a single source of truth
- Provides technical leadership for the design of the "Paved Path" regarding application assurance and reliability signals
- Evaluates and recommends the consolidation of disparate, non-unified observability tools and parallel support systems in favor of unified, strategic solutions
- Designs integration strategies for identity and access management to ensure secure developer access to platform tools
- Participates in the development of automated reliability signals and self-service observability tools
- Drives project work and creates automation for the observability stack and application lifecycle tools
- Participates in peer reviews and technical integration efforts to ensure cross-functional alignment within the PTD and CPS organizations
- Sets standards for application assurance, including vulnerability management and identity integration programs
- Recommends frameworks for measuring platform performance, such as Kubernetes API server uptime and provisioning delivery time
- Articulates the vision for a unified runtime that leverages both global on-premises footprints and cloud capabilities
- Leads the Observability Stack Unification charter as part of the broader CI/CD and platform consolidation effort
- Utilizes FinOps and financial observability reporting to provide cost attribution by product, team, and organization
- Defines and publishes critical reliability metrics, including Mean Time to Detect (MTTD) and Mean Time to Repair (MTTR)
- Provides L4 technical escalation capacity to stabilize critical, high-toil services
- Participates in on-call rotations for respective observability and operations areas to ensure 24/7 platform stability
- Serves as a technical liaison for internal product teams (the platform's customers) to understand concerns and priorities
- Acts as a primary point of contact for technical perspectives and alignment with stakeholders in the Equinix product organization and the Equinix IT organization
- Works with Engineering Managers to define platform KPIs and project schedules for unification efforts
- Provides status reporting on the Observability Standard and other strategic consolidation projects
- Investigates and evaluates new observability technologies to reduce infrastructure toil for product teams
- Influences the organization’s technical objectives by identifying fruitful opportunities in areas like telemetry and proactive alerting
Requirements:
- 10+ years in Platform Engineering, Site Reliability Engineering (SRE), or Observability-focused roles
- Bachelor's in Computer Science, Computer Engineering, or a related technical field
- Expert-level knowledge of Platform Engineering, Grafana Cloud, Observability concepts (Logs, Metrics, Traces, RUM, Synthetics, etc), and Operational Readiness
- Competence with Kubernetes, ArgoCD, on-premises and cloud infrastructure (AWS), software engineering practices including CI/CD
- Familiarity with Go development, cluster-api and the CNCF ecosystem