PanAgora Asset Management is focused on transforming financial lives and promoting a flexible work environment. The Senior Site Reliability Engineer will lead reliability initiatives, architect solutions for operational challenges, and mentor engineers to ensure high reliability of financial services infrastructure.

Responsibilities:

Design and implement highly available, fault-tolerant systems supporting critical financial transactions
Architect infrastructure solutions using AWS best practices, optimizing for cost, performance, and reliability
Lead complex incident response efforts, coordinating across teams to restore service rapidly
Drive postmortem processes for high-severity incidents, ensuring action items are identified and completed
Establish and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key services
Design and implement disaster recovery strategies and business continuity plans
Build advanced Infrastructure as Code solutions using Terraform, including modules, workspaces, and state management
Architect and optimize multi-cluster EKS environments, including pod autoscaling, cluster autoscaling, and resource optimization
Design observability strategies using Datadog and Splunk, including metrics, dashboards, and alerting that support proactive detection
Implement progressive delivery mechanisms (canary and blue-green deployments) within GitOps workflows
Build automation frameworks that reduce operational toil and improve team efficiency
Partner with development teams to improve application reliability, including design reviews and architectural guidance
Mentor junior and intermediate SREs through coaching and code reviews
Contribute to architectural decisions that impact platform reliability and scalability
Evangelize SRE best practices across the engineering organization
Participate in on-call rotations and drive improvements to reduce on-call burden
Implement and maintain zero-trust security controls across infrastructure
Ensure systems meet financial services regulatory requirements and internal compliance standards
Conduct security reviews of infrastructure changes and deployment processes
Participate in audit preparations and respond to compliance-related inquiries

Requirements:

Bachelor's degree in Computer Science, Information Systems, or similar emphasis, or equivalent experience
4 to 7 years of Site Reliability Engineering experience (or equivalent), with a track record operating large-scale production systems
Deep, hands-on expertise in AWS across a broad range of services and architectural patterns
Advanced Kubernetes knowledge, including custom resources, operators, and cluster federation concepts
Expert proficiency in Terraform, including module development, state management, and complex workflow orchestration
Strong programming skills in Python and/or Go, with ability to develop production-quality tools and services
Production experience implementing observability at scale using Datadog, Splunk, or similar platforms
Demonstrated experience establishing and maintaining CI/CD pipelines at enterprise scale
Deep understanding of GitOps principles and experience with tools such as ArgoCD or Flux
Proven ability to lead complex incident response and conduct thorough postmortems
Strong understanding of networking, security, and infrastructure design patterns
Experience mentoring engineers and conducting technical training
Experience in financial services or the payments industry
Deep knowledge of compliance frameworks (SOC 2, PCI DSS, FINRA)
AWS certifications (Solutions Architect Professional, DevOps Engineer Professional)
CKA and/or CKAD certifications
Experience with service mesh implementations (Istio, Linkerd, Consul)
Background in chaos engineering and fault injection testing
Experience with FinOps and cloud cost optimization
Contributions to open-source projects in the SRE/DevOps space
Experience implementing Operational Excellence strategies

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: