Longbridge is a fast-growing online brokerage platform on a mission to make investing smarter, simpler, and more accessible for everyone. They are seeking a hands-on Site Reliability Engineer to design, scale, and safeguard the reliability of their next-generation financial platforms, partnering closely with product and engineering teams globally.

Responsibilities:

Own system reliability : Design, implement, and operate highly available, secure distributed systems to meet strict uptime and performance targets
Build automation at scale : Develop and enforce best practices in monitoring, alerting, and infrastructure-as-code (e.g., Terraform, Ansible, Helm)
Partner globally : Work with development teams from design through deployment, ensuring reliability and resiliency are built in from day one
Lead incident response : Drive on-call processes, conduct root-cause analysis, and continuously reduce MTTR and failure recurrence
Future-proof our stack : Evaluate and adopt modern cloud-native technologies (e.g., Kubernetes, Prometheus, AWS/GCP) to keep systems secure and scalable
Stress-test and safeguard : Lead disaster recovery, chaos testing, and capacity planning for critical wealth management services

Requirements:

5+ years of experience in SRE, DevOps, or production engineering roles
Strong background in AWS (or GCP/Azure) and container orchestration (Docker, Kubernetes)
Proficiency in at least one programming language (Python, Go, or similar) for automation and tooling
Solid Linux administration skills and experience with CI/CD pipelines
Proven ability in incident management and troubleshooting distributed systems
Strong collaboration and communication skills across remote/global teams
Comfortable working in a fast-moving fintech/tech startup environment
Experience supporting regulated financial systems
Ability to communicate in Mandarin to collaborate with Asia-based colleagues

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: