Collaborate with Product, Customer Support, and Operations teams and serve as technology first responder to efficiently resolve escalated issues
Coordinate with and train additional team members (including offshore team members) to ensure 24/7 support
Monitor production systems for performance, availability, and reliability
Respond to and manage production incidents, triaging and performing root cause analysis and help implementing fixes to prevent recurrence
Develop, maintain, and optimize monitoring and alerting systems to proactively identify issues
Analyze system performance and identify areas for improvement
Collaborate with engineering teams to optimize application and infrastructure performance
Implement and manage capacity planning and scaling strategies
Automate repetitive tasks and processes to increase operational efficiency
Develop and maintain scripts and tools for system management and monitoring
Implement and maintain SRE principles, including service level objectives (SLOs), service level indicators (SLIs), and error budget
Participate in the design and implementation of disaster recovery and high-availability solutions
Maintain comprehensive documentation for systems, processes, and incident management procedures
Create and deliver regular reports on system health, incident trends, and performance metrics
Work closely with development teams and SaaS vendors’ product teams to ensure new features and changes are reliable and scalable
Partner with Saas vendors to ensure timely resolution of platform issues, enforcing SLAs for urgent issues and monitoring backlog remediation for items with lower priority or longer lead times
Communicate effectively with stakeholders to provide updates on incident resolution and system status
Stay current with industry trends, technologies, and best practices in SRE and production support
Propose and implement improvements to processes, tools, and systems
Perform other duties as assigned
Requirements
High School Diploma required; Bachelor’s Degree preferred
Minimum of 3 years of experience in engineering, production support, system administration, or related roles required
Experience performing support in a banking environment a plus
Strong understanding of SRE principles and practices
Scripting: Basic ability to read and modify Bash or Python scripts to automate routine tasks like log rotation or system health checks
Networking: Foundational knowledge of TCP/IP, DNS, and SSH to diagnose connectivity issues
Monitoring Basics: Familiarity with viewing dashboards in Grafana and responding to automated alerts
SQL & Data: Proficiency in writing basic SQL queries (SELECT, JOIN, WHERE) to extract data for troubleshooting or reporting
API Diagnostics: Basic experience using Postman to test endpoints and verify API responses
Financial Literacy: Understanding of banking operational discipline and why accuracy is critical in ACH and file movements
Incident Management: Understanding the lifecycle of a ticket in tools like Jira or ServiceNow
Documentation: High attention to detail in following and updating playbooks/runbooks
Version Control: Basic knowledge of Git for checking out code or configurations
Cloud & Containers: Conceptual understanding of AWS and Kubernetes (e.g., checking pod status)
Communication: Ability to translate technical issues into clear status updates for stakeholders during an outage
Critical Incident Response: Ability to remain calm under pressure, prioritize high-severity tickets, and communicate technical ETAs to non-technical stakeholders
Root Cause Analysis (RCA): Systematic approach to problem-solving (e.g., Five Whys, 8D) to ensure permanent fixes rather than temporary workarounds
Self-Starter Mentality: Ability to work independently in a Follow-the-Sun support model, participating in on-call rotations and weekend releases
Must be able to work shift hours (earliest 7 am PST, and latest 8 pm PST) with rotating weekly on-call shifts
Proficiency with Microsoft Office tools (Outlook, Word, PowerPoint, Excel)
Excellent verbal, written, and interpersonal communication skills
Strong organizational skills and attention to detail
Outstanding problem-solving and time management skills
Self-motivated, self-directed, and results-oriented
Adaptable and able to multitask in a fast-paced environment
Can work independently and within a team; solution-oriented with a collaborative approach
Tech Stack
AWS
Cloud
DNS
Grafana
Kubernetes
Linux
Python
ServiceNow
SQL
TCP/IP
Unix
Benefits
Comprehensive health, dental, and vision plans
4 weeks PTO
401k + company match
Metro SmartTrip benefits ($50/mo)
Remote or hybrid work schedules for most positions
Incentives for purchasing solar panels, electric vehicles, biking to work, etc.
Paid subscriptions to Veterans Compost, Capital Bikeshare, Imperfect Foods reimbursement, and more!
Best Workplaces for Commuters 2023 & 2024 winner
The Washington Post Top Workplaces 2023, 2024, and 2025 winner
American Banker Best Banks to Work For 2023 winner