WEX is seeking a highly motivated Site Reliability Engineer to join their team during a pivotal time in their SRE evolution. The role involves monitoring and managing system health in the Azure Cloud ecosystem, developing automation tools to reduce operational toil, and collaborating with development teams to enhance reliability practices.
Responsibilities:
- Monitor and manage system health, availability, and performance of WEX’s Microsoft Azure Cloud ecosystem
- Actively identify and reduce “toil” (manual, repetitive work) by developing and maintaining automation tools
- Participate in on-call rotations and respond to system alerts and incidents
- Collaborate with development teams to implement reliability-focused features
- Improve observability and logging for troubleshooting issues
- Follow IT security policies and compliance requirements
Requirements:
- 2+ years of experience in system administration, DevOps, or SRE roles
- Proficiency in scripting and automation using Python, Bash, Go, Terraform
- Experience with monitoring and logging (Grafana, ELK stack, Splunk, etc.)
- Knowledge of containerization and orchestration (Docker, Kubernetes)
- Understanding of CI/CD pipelines and version control systems
- Understanding of monitoring tools such as Prometheus, Grafana, or Splunk
- Strong problem-solving skills and a willingness to learn
- Monitor and manage system health, availability, and performance of WEX's Microsoft Azure Cloud ecosystem
- Actively identify and reduce 'toil' (manual, repetitive work) by developing and maintaining automation tools
- Participate in on-call rotations and respond to system alerts and incidents
- Collaborate with development teams to implement reliability-focused features
- Improve observability and logging for troubleshooting issues
- Follow IT security policies and compliance requirements
- Hands-on experience with Azure cloud platforms
- Familiarity with infrastructure as code (Terraform, Ansible, CloudFormation)
- Knowledge of incident response processes and SLAs
- Experience with developing AI based solutions
- Ability to troubleshoot and resolve performance bottlenecks
- Strong communication skills and ability to work across teams
- Experience in healthcare, insurance, or benefits technology
- Experience working with compliance frameworks such as HIPAA, SOC 2, or HITRUST