Paxos is on a mission to open the world’s financial system to everyone by enabling the instant movement of any asset in a trustworthy way. As a Staff Site Reliability Engineer, you will lead the technical direction for the Platform Engineering team, focusing on the design, reliability, and scalability of cloud systems, while ensuring compliance and resilience.
Responsibilities:
- Architect, build, and operate resilient, scalable, and self-healing cloud infrastructure on AWS
- Lead the evolution of Kubernetes and platform services to enable secure, automated, and multi-region operations
- Define and enforce Infrastructure as Code (IaC) standards using Terraform, AWS CDK, and Crossplane to ensure consistency, security, and auditability
- Drive automation across provisioning, configuration, and monitoring pipelines to reduce manual effort and operational risk
- Establish and champion reliability, observability, and performance standards across Tier-1 services, ensuring alignment with regulatory and partner requirements
- Partner with product engineering to enhance CI/CD velocity, service resilience, and visibility through shared tooling, SLOs, and platform patterns
- Lead incident reviews, root-cause analyses, and systemic reliability improvements, embedding learnings into runbooks and design practices
- Optimize cloud infrastructure for cost, performance, and fault tolerance, driving data-driven operational excellence
- Mentor and upskill engineers, shaping architectural direction and influencing design decisions across multiple teams
- Contribute to the technical strategy and roadmap for Paxos’ infrastructure platform, aligning platform scalability with business growth and compliance objectives
Requirements:
- Bachelor's degree in Computer Science, Information Technology, or a related field — or equivalent practical experience
- 8+ years of experience in Site Reliability Engineering, DevOps, or related infrastructure roles
- Deep expertise in public cloud platforms, especially AWS, with hands-on experience in services like EC2, S3, Lambda, CloudWatch, and IAM
- Strong proficiency with Kubernetes and container orchestration — you've run production workloads and understand cluster management, scaling, and troubleshooting
- Extensive experience with Infrastructure as Code (IaC) using tools such as Terraform, Pulumi, or Crossplane
- Solid scripting or programming skills in languages like Python, Bash, or Go, with a strong focus on automation
- Excellent problem-solving and debugging skills, with a systems-thinking mindset
- Strong communicator who thrives in collaborative, remote-first teams
- Working knowledge of managed database services like Amazon RDS, Aurora, or PostgreSQL is a plus — but infrastructure is your main game