OnePay is a consumer fintech trusted by millions of Americans to improve their financial experiences. As a Site Reliability Engineer, you will ensure the stability, scalability, and security of systems that support financial products, while driving reliability practices across teams.
Responsibilities:
- Design, build, and maintain scalable infrastructure and tooling that improves reliability, performance, and availability across OnePay’s platform
- Contribute to the evolution of our observability stack, platform libraries, cloud architecture, and CI/CD pipelines
- Develop automation and monitoring systems to detect, prevent, and remediate incidents before they impact customers
- Partner closely with product and platform engineering teams to embed reliability best practices in design, development, and deployment processes
- Lead root cause analysis and postmortems, driving long-term improvements in resiliency and fault tolerance
Requirements:
- 5+ years of experience as a Software Engineer with a focus on building and running reliable, large-scale, distributed systems in production
- 5+ years of operational experience in observability tooling and libraries (metrics, logging, tracing) with experience using Datadog or similar tools (Prometheus, Grafana)
- Proficiency in at least one programming language (Python, Go, Java, or Node.js preferred) for automation and tooling
- Proficiency in incident management, going on-call, and writing post-mortem reports
- Excellent collaboration skills with the ability to influence and educate product engineering teams on reliability and observability best practices
- Hands-on experience with cloud platforms (AWS preferred), container orchestration (Kubernetes), and IAC tools (Terraform, Pulumi)
- Drive and proactivity – everyone here is a builder and executor
- Proficiency in at least one programming language (Python, Go, Java, or Node.js preferred)
- Hands-on experience with cloud platforms (AWS preferred)