Empower is focused on transforming financial lives by providing a flexible work environment and opportunities for career growth. They are seeking an API Reliability Engineer to build and operate reliable, scalable API services while troubleshooting complex issues and improving system resilience.
Responsibilities:
- Own and improve the reliability, performance, and scalability of API services in production
- Troubleshoot and resolve P1/P2 production incidents end-to-end, analyzing issues across application, infrastructure, and integrations
- Work closely with API developers to identify and address reliability issues and application-level security vulnerabilities in service design and implementation
- Contribute targeted code-level or configuration fixes to resolve issues and prevent recurrence
- Participate in root cause analysis (RCA) and drive durable, long-term fixes
- Improve API resilience through patterns such as timeouts, retries, circuit breakers, and graceful degradation
- Establish and enhance observability and service health metrics, including logs, metrics, traces, and SLOs, using Datadog and Splunk
- Define and monitor SLAs/SLOs for API performance and availability
- Work with API Gateway and ALB/NLB for traffic management, routing, and system reliability
- Contribute to CI/CD pipelines using Jenkins to ensure safe and consistent deployments
- Contribute to disaster recovery readiness and system resilience planning
- Collaborate across engineering teams to improve system design and operational readiness
- Participate in an on-call rotation for critical incidents (P1/P2)
Requirements:
- Minimum 5 years of experience in backend or API development
- Strong hands-on experience with Java and Spring Boot
- Proven experience building, shipping, and operating APIs in production environments
- Strong problem-solving skills with the ability to debug real production issues end-to-end
- Experience handling P1/P2 incidents in production environments
- Solid understanding of API architecture, request lifecycle, and common failure patterns
- Experience with AWS services, including API Gateway, ALB/NLB, EC2, ECS/EKS, Lambda, RDS, or DynamoDB
- Familiarity with reliability patterns such as timeouts, retries, circuit breakers, and connection pooling
- Experience with observability tools such as Datadog and/or Splunk
- Experience with CI/CD pipelines, preferably Jenkins
- Strong debugging skills in distributed systems
- Experience with Git-based workflows and Agile development
- Bachelor's in Computer Science, Information Systems, or a related field; equivalent practical experience welcomed
- AWS certifications such as Solutions Architect or Developer Associate
- Experience with microservices and distributed system design
- Exposure to SLAs/SLOs and service health metrics
- Experience with Docker and Kubernetes
- Familiarity with API gateways, traffic routing, and load balancing strategies
- Experience in performance tuning and scalability improvements
- Strong communication skills during high-severity incidents