WEX is seeking a motivated mid-level Site Reliability Engineer (SRE) to enhance their reliability practices and tools. The role involves monitoring system health in the Azure Cloud, reducing operational toil through automation, and collaborating with development teams to implement reliability-focused features.
Responsibilities:
- Monitor and manage system health, availability, and performance of WEX’s Microsoft Azure Cloud ecosystem
- Actively identify and reduce 'toil' (manual, repetitive work) by developing and maintaining automation tools
- Participate in on-call rotations and respond to system alerts and incidents
- Collaborate with development teams to implement reliability-focused features
- Improve observability and logging for troubleshooting issues
- Follow IT security policies and compliance requirements
Requirements:
- 2+ years of experience in system administration, DevOps, or SRE roles
- Proficiency in scripting and automation using Python, Bash, Go, Terraform
- Experience with monitoring and logging (Grafana, ELK stack, Splunk, etc.)
- Knowledge of containerization and orchestration (Docker, Kubernetes)
- Understanding of CI/CD pipelines and version control systems
- Understanding of monitoring tools such as Prometheus, Grafana, or Splunk
- Strong problem-solving skills and a willingness to learn
- Monitor and manage system health, availability, and performance of WEX's Microsoft Azure Cloud ecosystem
- Actively identify and reduce 'toil' (manual, repetitive work) by developing and maintaining automation tools
- Participate in on-call rotations and respond to system alerts and incidents
- Collaborate with development teams to implement reliability-focused features
- Improve observability and logging for troubleshooting issues
- Follow IT security policies and compliance requirements
- Hands-on experience with Azure cloud platforms
- Familiarity with infrastructure as code (Terraform, Ansible, CloudFormation)
- Knowledge of incident response processes and SLAs
- Experience with developing AI based solutions
- Ability to troubleshoot and resolve performance bottlenecks
- Strong communication skills and ability to work across teams
- Experience in healthcare, insurance, or benefits technology
- Experience working with compliance frameworks such as HIPAA, SOC 2, or HITRUST