PanAgora Asset Management is committed to transforming financial lives through a flexible and inclusive work environment. They are seeking a Site Reliability Engineer (SRE) to ensure the reliability and operational excellence of their AWS-based data platform, applying core SRE principles to enhance data infrastructure support.

Responsibilities:

Own and improve the reliability, stability, scalability, and performance of our core data platforms and services
Provide operational support for large-scale, distributed data systems, ensuring high availability and strong SLAs
Partner closely with full-stack, data, and platform engineering teams to deliver continuous improvements
Operate and support EMR and EMR Serverless (Python/Spark) workloads and data pipelines
Support and optimize Amazon Redshift and DynamoDB in high-throughput, production environments
Design, build, and evolve monitoring, alerting, and observability frameworks with a focus on symptoms, not just outages
Lead incident response, troubleshooting production issues across the full stack and coordinating with internal and external stakeholders
Perform root cause analysis (RCA) and readiness reviews; turn findings into durable fixes and automation
Create and maintain runbooks, SOPs, and operational documentation
Collaborate with engineering teams to optimize performance, reliability, and cost
Participate in an on-call rotation to respond to incidents impacting customer-facing systems
Recommend and influence the use of AWS managed services and architectural patterns
Continuously evaluate system performance, capacity, and cost to scale efficiently

Requirements:

4–6 years of experience building or operating systems across multiple architecture domains: application, data, integration, infrastructure, and security
4+ years of hands-on AWS experience, with strong production exposure to several of the following: Redshift, DynamoDB, EMR, EMR Serverless, EC2, S3
Proven experience operating data platforms such as data lakes and data warehouses in production
Strong SQL skills and experience working with modern databases (e.g., Redshift, DynamoDB, Postgres, MySQL, Oracle)
4+ years of Python experience, including scripting, automation, or data workloads
Experience with CloudWatch, infrastructure monitoring, and alerting
Hands-on experience with incident management, uptime SLAs, and customer-impacting systems
Strong understanding of Git-based workflows (GitHub, Git Flow, or similar)
Experience working in Agile environments (Scrum / Kanban) using tools such as Jira and Confluence
Bachelor's in Computer Science, Information Systems, Data/Analytics, or related; equivalent practical experience welcomed
Own and improve the reliability, stability, scalability, and performance of our core data platforms and services
Provide operational support for large-scale, distributed data systems, ensuring high availability and strong SLAs
Partner closely with full-stack, data, and platform engineering teams to deliver continuous improvements
Operate and support EMR and EMR Serverless (Python/Spark) workloads and data pipelines
Support and optimize Amazon Redshift and DynamoDB in high-throughput, production environments
Design, build, and evolve monitoring, alerting, and observability frameworks with a focus on symptoms, not just outages
Lead incident response, troubleshooting production issues across the full stack and coordinating with internal and external stakeholders
Perform root cause analysis (RCA) and readiness reviews; turn findings into durable fixes and automation
Create and maintain runbooks, SOPs, and operational documentation
Collaborate with engineering teams to optimize performance, reliability, and cost
Participate in an on-call rotation to respond to incidents impacting customer-facing systems
Recommend and influence the use of AWS managed services and architectural patterns
Continuously evaluate system performance, capacity, and cost to scale efficiently
Experience with Terraform or other Infrastructure-as-Code tools
Exposure to Snowflake or experience supporting analytics platforms beyond Redshift
Experience in financial services or other highly regulated environments
Knowledge of DevOps and CI/CD best practices
Familiarity with observability tools such as Splunk, AppDynamics, or advanced CloudWatch usage
Comfortable working across Linux/Unix environments
Strong communication skills during incident response with both technical and non-technical stakeholders
Security-minded approach to building secure, reliable, and durable systems
Willingness to support occasional off-hours or weekend incidents as part of on-call responsibilities
Streaming/event pipelines (Kafka/Kinesis), CDC patterns, and backfill strategies
Experience with OpenLineage/Marquez and catalog integrations (Collibra/Alation/Purview)
Prior FinOps or capacity-planning ownership for data platforms
Familiarity with BI semantic layers and contract enforcement at consumption (Looker/Power BI/Tableau)

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: