PanAgora Asset Management is focused on transforming financial lives and promoting a flexible work environment. The Senior Site Reliability Engineer will lead reliability initiatives, architect solutions for operational challenges, and mentor engineers to ensure high reliability of financial services infrastructure.
Responsibilities:
- Design and implement highly available, fault-tolerant systems supporting critical financial transactions
- Architect infrastructure solutions using AWS best practices, optimizing for cost, performance, and reliability
- Lead complex incident response efforts, coordinating across teams to restore service rapidly
- Drive postmortem processes for high-severity incidents, ensuring action items are identified and completed
- Establish and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key services
- Design and implement disaster recovery strategies and business continuity plans
- Build advanced Infrastructure as Code solutions using Terraform, including modules, workspaces, and state management
- Architect and optimize multi-cluster EKS environments, including pod autoscaling, cluster autoscaling, and resource optimization
- Design observability strategies using Datadog and Splunk, including metrics, dashboards, and alerting that support proactive detection
- Implement progressive delivery mechanisms (canary and blue-green deployments) within GitOps workflows
- Build automation frameworks that reduce operational toil and improve team efficiency
- Partner with development teams to improve application reliability, including design reviews and architectural guidance
- Mentor junior and intermediate SREs through coaching and code reviews
- Contribute to architectural decisions that impact platform reliability and scalability
- Evangelize SRE best practices across the engineering organization
- Participate in on-call rotations and drive improvements to reduce on-call burden
- Implement and maintain zero-trust security controls across infrastructure
- Ensure systems meet financial services regulatory requirements and internal compliance standards
- Conduct security reviews of infrastructure changes and deployment processes
- Participate in audit preparations and respond to compliance-related inquiries
Requirements:
- Bachelor's degree in Computer Science, Information Systems, or similar emphasis, or equivalent experience
- 4 to 7 years of Site Reliability Engineering experience (or equivalent), with a track record operating large-scale production systems
- Deep, hands-on expertise in AWS across a broad range of services and architectural patterns
- Advanced Kubernetes knowledge, including custom resources, operators, and cluster federation concepts
- Expert proficiency in Terraform, including module development, state management, and complex workflow orchestration
- Strong programming skills in Python and/or Go, with ability to develop production-quality tools and services
- Production experience implementing observability at scale using Datadog, Splunk, or similar platforms
- Demonstrated experience establishing and maintaining CI/CD pipelines at enterprise scale
- Deep understanding of GitOps principles and experience with tools such as ArgoCD or Flux
- Proven ability to lead complex incident response and conduct thorough postmortems
- Strong understanding of networking, security, and infrastructure design patterns
- Experience mentoring engineers and conducting technical training
- Experience in financial services or the payments industry
- Deep knowledge of compliance frameworks (SOC 2, PCI DSS, FINRA)
- AWS certifications (Solutions Architect Professional, DevOps Engineer Professional)
- CKA and/or CKAD certifications
- Experience with service mesh implementations (Istio, Linkerd, Consul)
- Background in chaos engineering and fault injection testing
- Experience with FinOps and cloud cost optimization
- Contributions to open-source projects in the SRE/DevOps space
- Experience implementing Operational Excellence strategies