Act as the technical owner of the Platform Squad, defining, driving, and enforcing platform standards across the full lifecycle (design, rollout, upgrades, and decommissioning) for: Cloud infrastructure Kubernetes Service Mesh.
Ensure platform components are designed and operated according to SRE principles, focusing on reliability, scalability, and operational simplicity.
Drive architectural decisions with a sustainable platform vision, balancing innovation, security, and operational stability.
Define, build, and continuously improve operational processes for internal and external consumers, including: Platform onboarding and adoption Change management and release processes Incident, problem, and escalation management.
Act as a point of escalation for complex platform incidents and reliability risks, participating in on-call rotations as needed.
Ensure platform operations comply with internal controls, audit requirements, and security standards.
Establish and own platform observability standards, ensuring consistent implementation of Golden Signals: Latency Traffic Errors Saturation.
Define and track platform SLIs, SLOs, and error budgets in partnership with internal consumers.
Use metrics and operational data to drive prioritization, reliability improvements, and capacity planning decisions.
Foster a collaborative, servant-leadership culture that enables squads to self-serve while maintaining guardrails.
Collaborate closely with application engineering teams, other SRE squads, and stakeholders across security, compliance, and architecture.
Promote knowledge sharing through strong documentation and enablement around platform usage and best practices.
Provide technical mentorship and guidance to platform engineers, supporting engineering excellence and growth.
Support the Squad Manager in planning, prioritization, and execution of platform initiatives.
Ensure work is visible, well-documented, and aligned with broader SRE and company objectives.
Requirements
Proven experience in Platform Engineering and/or SRE roles, with demonstrated technical leadership.
Strong hands-on experience with public cloud platforms (AWS preferred; Azure is a plus).
Strong experience operating Kubernetes at scale (EKS or equivalent).
Experience with Service Mesh technologies (Istio preferred; App Mesh, Linkerd, etc. are a plus).
Solid understanding of SRE fundamentals, including SLIs/SLOs, error budgets, and reliability-driven prioritization.
Strong experience with observability tooling and practices, including metrics, logging, tracing, alerting, and Golden Signals.
Strong incident management and on-call operations experience, including escalation and problem management.
Experience with Infrastructure as Code (e.g., Terraform) and cloud-native operational patterns.
Strong understanding of cloud-native microservices architecture and platform enablement patterns.
Ability to translate complex technical concepts into clear guidance for non-platform teams.
Excellent collaboration, communication, and stakeholder management skills.
Tech Stack
AWS
Azure
Cloud
Kubernetes
Microservices
Terraform
Benefits
Competitive salary
Health insurance
Flexible work arrangements
Professional development
Manager, Site Reliability Engineer – Platform at Visa | JobVerse