Role: Site Reliability Engineering (SRE) / Architect
Location : Phoenix AZ - Onsite
We are seeking an experienced Site Reliability Engineering (SRE) Lead / Architect to design, build, and evolve highly available, scalable, and secure payment platforms. The role requires strong expertise across AWS cloud, enterprise middleware (IBM WebSphere, IBM MQ), modern application stacks, observability, and DevOps, with deep understanding of Payments domain systems.
You will define SRE strategy, reliability architecture, and operational excellence while collaborating closely with application, infrastructure, security, and business teams.
Key Responsibilities
Reliability & Architecture
- Design and architect highly resilient, faulttolerant payment systems supporting high throughput and low latency SLAs.
- Define SRE principles, including SLOs, SLIs, error budgets, and reliability KPIs for missioncritical payment services.
- Lead architecture decisions for cloudnative, hybrid, and legacy systems, including IBM WebSphere based platforms.
- Drive activeactive, DR, and HA strategies for AWS and onprem integrations.
Cloud & Platform Engineering
- Architect and operate workloads on AWS (EC2, EKS/ECS, RDS, S3, IAM, VPC, CloudWatch).
- Optimize infrastructure for scalability, availability, security, and cost efficiency.
- Guide containerization and orchestration strategies where applicable.
Application & Middleware Expertise
- Partner with development teams on Java, Spring Boot based microservices.
- Support frontend platforms built using React and Angular in terms of performance and reliability.
- Architect and operate messaging platforms using Kafka and IBM MQ.
- Manage enterprise middleware including IBM WebSphere Application Server.
DevOps & Automation
- Build and maintain CI/CD pipelines using Jenkins.
- Automate infrastructure provisioning, deployments, monitoring, and recovery processes.
- Promote Infrastructure as Code (IaC) and immutable infrastructure best practices.
- Champion DevOps and SRE culture across engineering teams.
Observability & Operations
- Design and standardize monitoring, logging, and alerting using:
- Splunk
- AWS CloudWatch
- Datadog
- Implement proactive monitoring and advanced alerting for payment flows.
- Lead incident response, root cause analysis (RCA), and postincident reviews.
- Drive reduction in MTTR and recurring incidents.
Database & Data Layer
- Architect and support PostgreSQL and Oracle databases with focus on:
- High availability
- Performance tuning
- Backup, restore, and disaster recovery
Payments Domain Leadership
- Provide reliability leadership for payment processing systems (authorization, capture, settlement, reconciliation).
- Ensure compliance with PCIDSS, security, and regulatory standards relevant to payments.
- Understand dependencies across gateways, processors, fraud, and downstream systems.
Leadership & Collaboration
- Act as technical lead/architect for SRE initiatives.
- Mentor SREs and engineers; guide best practices and standards.
- Work closely with product, architecture, security, and operations teams.
- Influence executive stakeholders on reliability, risk, and scalability decisions.