Paxos is on a mission to open the world’s financial system to everyone by enabling the instant movement of any asset in a trustworthy way. As a Staff Site Reliability Engineer, you will lead the technical direction for the Platform Engineering team, focusing on the design, reliability, and scalability of cloud systems, while ensuring compliance and resilience.

Responsibilities:

Architect, build, and operate resilient, scalable, and self-healing cloud infrastructure on AWS
Lead the evolution of Kubernetes and platform services to enable secure, automated, and multi-region operations
Define and enforce Infrastructure as Code (IaC) standards using Terraform, AWS CDK, and Crossplane to ensure consistency, security, and auditability
Drive automation across provisioning, configuration, and monitoring pipelines to reduce manual effort and operational risk
Establish and champion reliability, observability, and performance standards across Tier-1 services, ensuring alignment with regulatory and partner requirements
Partner with product engineering to enhance CI/CD velocity, service resilience, and visibility through shared tooling, SLOs, and platform patterns
Lead incident reviews, root-cause analyses, and systemic reliability improvements, embedding learnings into runbooks and design practices
Optimize cloud infrastructure for cost, performance, and fault tolerance, driving data-driven operational excellence
Mentor and upskill engineers, shaping architectural direction and influencing design decisions across multiple teams
Contribute to the technical strategy and roadmap for Paxos’ infrastructure platform, aligning platform scalability with business growth and compliance objectives

Requirements:

Bachelor's degree in Computer Science, Information Technology, or a related field — or equivalent practical experience
8+ years of experience in Site Reliability Engineering, DevOps, or related infrastructure roles
Deep expertise in public cloud platforms, especially AWS, with hands-on experience in services like EC2, S3, Lambda, CloudWatch, and IAM
Strong proficiency with Kubernetes and container orchestration — you've run production workloads and understand cluster management, scaling, and troubleshooting
Extensive experience with Infrastructure as Code (IaC) using tools such as Terraform, Pulumi, or Crossplane
Solid scripting or programming skills in languages like Python, Bash, or Go, with a strong focus on automation
Excellent problem-solving and debugging skills, with a systems-thinking mindset
Strong communicator who thrives in collaborative, remote-first teams
Working knowledge of managed database services like Amazon RDS, Aurora, or PostgreSQL is a plus — but infrastructure is your main game

Staff Site Reliability Engineer, Platform Engineering

Key skills

About this role

Responsibilities:

Requirements: