Zelis is modernizing the healthcare financial experience across payers, providers, and healthcare consumers. They are seeking a Site Reliability Engineer to define and drive the observability roadmap across all platforms, focusing on enhancing system reliability and operational efficiency.
Responsibilities:
- Define a unified vision for observability across all platforms, with golden signals as the foundation for monitoring and alerting
- Develop and maintain a comprehensive roadmap to improve observability, reduce tool redundancy, and standardize practices across platforms
- Establish and track key performance indicators (KPIs) to measure progress and ensure accountability for roadmap milestones
- Partner with the ZEIT SRE team and engineering leads to break down silos and promote consistent observability practices
- Drive cross-platform collaboration to reduce operational inconsistencies and define a 'north star' approach for observability
- Facilitate knowledge sharing to ensure alignment on current and future observability initiatives
- Standardize the implementation of golden signals across applications to improve system reliability and incident detection
- Optimize alerting tools and reduce redundant or ineffective monitoring interfaces ('panes of glass')
- Lead efforts to enhance observability while minimizing operational overhead for platform teams
- Maintain and enhance observability dashboards, delivering actionable insights into application health and performance
- Identify and address gaps in existing observability practices, prioritizing long-term scalability and reliability
- Collaborate with India-based resources to execute observability build-outs efficiently and with high quality
- Reduce client, provider, and print facility-raised issues through proactive monitoring and early detection
- Measure and report on observability success metrics, including actionable alert volume and reduced issue escalations
- Continuously evaluate and refine observability strategies based on stakeholder feedback and evolving organizational needs
Requirements:
- Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience)
- Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or a related role with a strong focus on observability
- 5+ years of hands-on experience with .NET (C#), including advanced knowledge of ASP.NET Core, Web APIs, and performance optimization
- Demonstrated success in designing and implementing monitoring and alerting solutions across complex IT environments
- Deep understanding of SRE principles and golden signals for system monitoring
- Proficiency with observability tools such as Prometheus, Grafana, Splunk, New Relic, or Datadog
- Familiarity with cloud platforms (AWS, Azure, GCP) and containerization technologies (Docker, Kubernetes)
- Advanced proficiency in scripting languages such as PowerShell
- Experience in front-end development using React.js
- Advanced knowledge of .NET
- Strong leadership and collaboration abilities, with a proven ability to align diverse teams toward common goals
- Excellent analytical and problem-solving skills, with a proactive approach to identifying and resolving issues
- Clear and effective communication skills, capable of conveying technical concepts to stakeholders at all levels
- Experience with building observability roadmaps and scaling solutions in enterprise environments
- Certifications in cloud or DevOps-related disciplines (e.g., AWS Certified DevOps Engineer, Kubernetes Administrator)