Role: Site Reliability Engineering (SRE) / Architect
Location : Phoenix AZ - Onsite

We are seeking an experienced Site Reliability Engineering (SRE) Lead / Architect to design, build, and evolve highly available, scalable, and secure payment platforms. The role requires strong expertise across AWS cloud, enterprise middleware (IBM WebSphere, IBM MQ), modern application stacks, observability, and DevOps, with deep understanding of Payments domain systems.

You will define SRE strategy, reliability architecture, and operational excellence while collaborating closely with application, infrastructure, security, and business teams.

Key Responsibilities

Reliability & Architecture

Design and architect highly resilient, faulttolerant payment systems supporting high throughput and low latency SLAs.
Define SRE principles, including SLOs, SLIs, error budgets, and reliability KPIs for missioncritical payment services.
Lead architecture decisions for cloudnative, hybrid, and legacy systems, including IBM WebSphere based platforms.
Drive activeactive, DR, and HA strategies for AWS and onprem integrations.

Cloud & Platform Engineering

Architect and operate workloads on AWS (EC2, EKS/ECS, RDS, S3, IAM, VPC, CloudWatch).
Optimize infrastructure for scalability, availability, security, and cost efficiency.
Guide containerization and orchestration strategies where applicable.

Application & Middleware Expertise

Partner with development teams on Java, Spring Boot based microservices.
Support frontend platforms built using React and Angular in terms of performance and reliability.
Architect and operate messaging platforms using Kafka and IBM MQ.
Manage enterprise middleware including IBM WebSphere Application Server.

DevOps & Automation

Build and maintain CI/CD pipelines using Jenkins.
Automate infrastructure provisioning, deployments, monitoring, and recovery processes.
Promote Infrastructure as Code (IaC) and immutable infrastructure best practices.
Champion DevOps and SRE culture across engineering teams.

Observability & Operations

Design and standardize monitoring, logging, and alerting using:

Splunk
AWS CloudWatch
Datadog

Implement proactive monitoring and advanced alerting for payment flows.
Lead incident response, root cause analysis (RCA), and postincident reviews.
Drive reduction in MTTR and recurring incidents.

Database & Data Layer

Architect and support PostgreSQL and Oracle databases with focus on:

High availability
Performance tuning
Backup, restore, and disaster recovery

Payments Domain Leadership

Provide reliability leadership for payment processing systems (authorization, capture, settlement, reconciliation).
Ensure compliance with PCIDSS, security, and regulatory standards relevant to payments.
Understand dependencies across gateways, processors, fraud, and downstream systems.

Leadership & Collaboration

Act as technical lead/architect for SRE initiatives.
Mentor SREs and engineers; guide best practices and standards.
Work closely with product, architecture, security, and operations teams.
Influence executive stakeholders on reliability, risk, and scalability decisions.

Site Reliability Engineering (SRE) / Architect

Key skills

About this role