OnePay is a consumer fintech trusted by millions of Americans to improve their financial experiences. As a Site Reliability Engineer, you will ensure the stability, scalability, and security of systems that support financial products, while driving reliability practices across teams.

Responsibilities:

Design, build, and maintain scalable infrastructure and tooling that improves reliability, performance, and availability across OnePay’s platform
Contribute to the evolution of our observability stack, platform libraries, cloud architecture, and CI/CD pipelines
Develop automation and monitoring systems to detect, prevent, and remediate incidents before they impact customers
Partner closely with product and platform engineering teams to embed reliability best practices in design, development, and deployment processes
Lead root cause analysis and postmortems, driving long-term improvements in resiliency and fault tolerance

Requirements:

5+ years of experience as a Software Engineer with a focus on building and running reliable, large-scale, distributed systems in production
5+ years of operational experience in observability tooling and libraries (metrics, logging, tracing) with experience using Datadog or similar tools (Prometheus, Grafana)
Proficiency in at least one programming language (Python, Go, Java, or Node.js preferred) for automation and tooling
Proficiency in incident management, going on-call, and writing post-mortem reports
Excellent collaboration skills with the ability to influence and educate product engineering teams on reliability and observability best practices
Hands-on experience with cloud platforms (AWS preferred), container orchestration (Kubernetes), and IAC tools (Terraform, Pulumi)
Drive and proactivity – everyone here is a builder and executor
Proficiency in at least one programming language (Python, Go, Java, or Node.js preferred)
Hands-on experience with cloud platforms (AWS preferred)

Software Engineer- SRE

Key skills

About this role

Responsibilities:

Requirements: