Paxos is on a mission to open the world’s financial system to everyone by enabling the instant movement of any asset in a trustworthy way. As a Staff Site Reliability Engineer, you will serve as a technical leader and architect within the Platform Engineering team, shaping the design, reliability, and scalability of Paxos’ next-generation infrastructure platform.
Responsibilities:
- Architect, build, and operate resilient, scalable, and self-healing cloud infrastructure on AWS
- Lead the evolution of Kubernetes and platform services to enable secure, automated, and multi-region operations
- Define and enforce Infrastructure as Code (IaC) standards using Terraform, AWS CDK, and Crossplane to ensure consistency, security, and auditability
- Drive automation across provisioning, configuration, and monitoring pipelines to reduce manual effort and operational risk
- Establish and champion reliability, observability, and performance standards across Tier-1 services, ensuring alignment with regulatory and partner requirements
- Partner with product engineering to enhance CI/CD velocity, service resilience, and visibility through shared tooling, SLOs, and platform patterns
- Lead incident reviews, root-cause analyses, and systemic reliability improvements, embedding learnings into runbooks and design practices
- Optimize cloud infrastructure for cost, performance, and fault tolerance, driving data-driven operational excellence
- Mentor and upskill engineers, shaping architectural direction and influencing design decisions across multiple teams
- Contribute to the technical strategy and roadmap for Paxos’ infrastructure platform, aligning platform scalability with business growth and compliance objectives
Requirements:
- Bachelor's degree in Computer Science, Information Technology, or a related field — or equivalent practical experience
- 8+ years of experience in Site Reliability Engineering, DevOps, or related infrastructure roles
- Deep expertise in public cloud platforms, especially AWS, with hands-on experience in services like EC2, S3, Lambda, CloudWatch, and IAM
- Strong proficiency with Kubernetes and container orchestration — you've run production workloads and understand cluster management, scaling, and troubleshooting
- Extensive experience with Infrastructure as Code (IaC) using tools such as Terraform, Pulumi, or Crossplane
- Solid scripting or programming skills in languages like Python, Bash, or Go, with a strong focus on automation
- Excellent problem-solving and debugging skills, with a systems-thinking mindset
- Strong communicator who thrives in collaborative, remote-first teams
- Working knowledge of managed database services like Amazon RDS, Aurora, or PostgreSQL is a plus — but infrastructure is your main game