Longbridge is a fast-growing online brokerage platform on a mission to make investing smarter, simpler, and more accessible for everyone. They are seeking a hands-on Site Reliability Engineer to design, scale, and safeguard the reliability of their next-generation financial platforms, partnering closely with product and engineering teams globally.
Responsibilities:
- Own system reliability : Design, implement, and operate highly available, secure distributed systems to meet strict uptime and performance targets
- Build automation at scale : Develop and enforce best practices in monitoring, alerting, and infrastructure-as-code (e.g., Terraform, Ansible, Helm)
- Partner globally : Work with development teams from design through deployment, ensuring reliability and resiliency are built in from day one
- Lead incident response : Drive on-call processes, conduct root-cause analysis, and continuously reduce MTTR and failure recurrence
- Future-proof our stack : Evaluate and adopt modern cloud-native technologies (e.g., Kubernetes, Prometheus, AWS/GCP) to keep systems secure and scalable
- Stress-test and safeguard : Lead disaster recovery, chaos testing, and capacity planning for critical wealth management services
Requirements:
- 5+ years of experience in SRE, DevOps, or production engineering roles
- Strong background in AWS (or GCP/Azure) and container orchestration (Docker, Kubernetes)
- Proficiency in at least one programming language (Python, Go, or similar) for automation and tooling
- Solid Linux administration skills and experience with CI/CD pipelines
- Proven ability in incident management and troubleshooting distributed systems
- Strong collaboration and communication skills across remote/global teams
- Comfortable working in a fast-moving fintech/tech startup environment
- Experience supporting regulated financial systems
- Ability to communicate in Mandarin to collaborate with Asia-based colleagues