PanAgora Asset Management is committed to transforming financial lives through a flexible and inclusive work environment. They are seeking a Site Reliability Engineer (SRE) to ensure the reliability and operational excellence of their AWS-based data platform, applying core SRE principles to enhance data infrastructure support.
Responsibilities:
- Own and improve the reliability, stability, scalability, and performance of our core data platforms and services
- Provide operational support for large-scale, distributed data systems, ensuring high availability and strong SLAs
- Partner closely with full-stack, data, and platform engineering teams to deliver continuous improvements
- Operate and support EMR and EMR Serverless (Python/Spark) workloads and data pipelines
- Support and optimize Amazon Redshift and DynamoDB in high-throughput, production environments
- Design, build, and evolve monitoring, alerting, and observability frameworks with a focus on symptoms, not just outages
- Lead incident response, troubleshooting production issues across the full stack and coordinating with internal and external stakeholders
- Perform root cause analysis (RCA) and readiness reviews; turn findings into durable fixes and automation
- Create and maintain runbooks, SOPs, and operational documentation
- Collaborate with engineering teams to optimize performance, reliability, and cost
- Participate in an on-call rotation to respond to incidents impacting customer-facing systems
- Recommend and influence the use of AWS managed services and architectural patterns
- Continuously evaluate system performance, capacity, and cost to scale efficiently
Requirements:
- 4–6 years of experience building or operating systems across multiple architecture domains: application, data, integration, infrastructure, and security
- 4+ years of hands-on AWS experience, with strong production exposure to several of the following: Redshift, DynamoDB, EMR, EMR Serverless, EC2, S3
- Proven experience operating data platforms such as data lakes and data warehouses in production
- Strong SQL skills and experience working with modern databases (e.g., Redshift, DynamoDB, Postgres, MySQL, Oracle)
- 4+ years of Python experience, including scripting, automation, or data workloads
- Experience with CloudWatch, infrastructure monitoring, and alerting
- Hands-on experience with incident management, uptime SLAs, and customer-impacting systems
- Strong understanding of Git-based workflows (GitHub, Git Flow, or similar)
- Experience working in Agile environments (Scrum / Kanban) using tools such as Jira and Confluence
- Bachelor's in Computer Science, Information Systems, Data/Analytics, or related; equivalent practical experience welcomed
- Own and improve the reliability, stability, scalability, and performance of our core data platforms and services
- Provide operational support for large-scale, distributed data systems, ensuring high availability and strong SLAs
- Partner closely with full-stack, data, and platform engineering teams to deliver continuous improvements
- Operate and support EMR and EMR Serverless (Python/Spark) workloads and data pipelines
- Support and optimize Amazon Redshift and DynamoDB in high-throughput, production environments
- Design, build, and evolve monitoring, alerting, and observability frameworks with a focus on symptoms, not just outages
- Lead incident response, troubleshooting production issues across the full stack and coordinating with internal and external stakeholders
- Perform root cause analysis (RCA) and readiness reviews; turn findings into durable fixes and automation
- Create and maintain runbooks, SOPs, and operational documentation
- Collaborate with engineering teams to optimize performance, reliability, and cost
- Participate in an on-call rotation to respond to incidents impacting customer-facing systems
- Recommend and influence the use of AWS managed services and architectural patterns
- Continuously evaluate system performance, capacity, and cost to scale efficiently
- Experience with Terraform or other Infrastructure-as-Code tools
- Exposure to Snowflake or experience supporting analytics platforms beyond Redshift
- Experience in financial services or other highly regulated environments
- Knowledge of DevOps and CI/CD best practices
- Familiarity with observability tools such as Splunk, AppDynamics, or advanced CloudWatch usage
- Comfortable working across Linux/Unix environments
- Strong communication skills during incident response with both technical and non-technical stakeholders
- Security-minded approach to building secure, reliable, and durable systems
- Willingness to support occasional off-hours or weekend incidents as part of on-call responsibilities
- Streaming/event pipelines (Kafka/Kinesis), CDC patterns, and backfill strategies
- Experience with OpenLineage/Marquez and catalog integrations (Collibra/Alation/Purview)
- Prior FinOps or capacity-planning ownership for data platforms
- Familiarity with BI semantic layers and contract enforcement at consumption (Looker/Power BI/Tableau)