Zelis is modernizing the healthcare financial experience across payers, providers, and healthcare consumers. They are seeking a Site Reliability Engineer to define and drive the observability roadmap across all platforms, focusing on enhancing system reliability and operational efficiency.

Responsibilities:

Define a unified vision for observability across all platforms, with golden signals as the foundation for monitoring and alerting
Develop and maintain a comprehensive roadmap to improve observability, reduce tool redundancy, and standardize practices across platforms
Establish and track key performance indicators (KPIs) to measure progress and ensure accountability for roadmap milestones
Partner with the ZEIT SRE team and engineering leads to break down silos and promote consistent observability practices
Drive cross-platform collaboration to reduce operational inconsistencies and define a 'north star' approach for observability
Facilitate knowledge sharing to ensure alignment on current and future observability initiatives
Standardize the implementation of golden signals across applications to improve system reliability and incident detection
Optimize alerting tools and reduce redundant or ineffective monitoring interfaces ('panes of glass')
Lead efforts to enhance observability while minimizing operational overhead for platform teams
Maintain and enhance observability dashboards, delivering actionable insights into application health and performance
Identify and address gaps in existing observability practices, prioritizing long-term scalability and reliability
Collaborate with India-based resources to execute observability build-outs efficiently and with high quality
Reduce client, provider, and print facility-raised issues through proactive monitoring and early detection
Measure and report on observability success metrics, including actionable alert volume and reduced issue escalations
Continuously evaluate and refine observability strategies based on stakeholder feedback and evolving organizational needs

Requirements:

Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience)
Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or a related role with a strong focus on observability
5+ years of hands-on experience with .NET (C#), including advanced knowledge of ASP.NET Core, Web APIs, and performance optimization
Demonstrated success in designing and implementing monitoring and alerting solutions across complex IT environments
Deep understanding of SRE principles and golden signals for system monitoring
Proficiency with observability tools such as Prometheus, Grafana, Splunk, New Relic, or Datadog
Familiarity with cloud platforms (AWS, Azure, GCP) and containerization technologies (Docker, Kubernetes)
Advanced proficiency in scripting languages such as PowerShell
Experience in front-end development using React.js
Advanced knowledge of .NET
Strong leadership and collaboration abilities, with a proven ability to align diverse teams toward common goals
Excellent analytical and problem-solving skills, with a proactive approach to identifying and resolving issues
Clear and effective communication skills, capable of conveying technical concepts to stakeholders at all levels
Experience with building observability roadmaps and scaling solutions in enterprise environments
Certifications in cloud or DevOps-related disciplines (e.g., AWS Certified DevOps Engineer, Kubernetes Administrator)

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: