Availity is a healthcare technology company that delivers revenue cycle and related business solutions for healthcare professionals. As a Platform Engineer IV, you will lead observability practices and ensure the health and stability of critical infrastructure systems that support U.S. healthcare services.
Responsibilities:
- Own and evolve our observability practices at enterprise scale
- Lead the tooling and support for observability and logging services (New Relic, Splunk, Cribl, OpenTelemetry) with reliability as your north star
- Oversee the delivery of automated deployment solutions for observability tools and governance that ensures observability coverage for mission-critical internal platforms in our AWS private cloud
- Guide and mentor engineers while setting the bar for operational excellence
- Provide technical leadership for the infrastructure engineering and operations team focused on observability services
- Owning and advancing the observability practices implemented across all enterprise technology groups
- Managing observability and logging platforms including:
- Splunk (EKS + on-prem components, forwarders, deployment server)
- Cribl operational pipelines (EKS-based)
- New Relic SaaS integrations and Prometheus data ingestion
- OpenTelemetry & KubeLogging/Banzai Operator for distributed tracing and logging pipelines
- Prometheus/Grafana migrations from on-prem OCP to AWS for metrics scraping and synthetic monitoring
- Overseeing observability deployment solutions for platforms hosted in AWS
- Driving infrastructure-as-code practices (Terraform, Helm, Ansible) for repeatable deployments and environment consistency
- Collaborating with engineering, middleware, and product teams to define clear ownership, reduce friction, and ensure platform services enable—not block—delivery
- Ensuring upgrades, patching, and platform updates are proactively planned and executed without business disruption
- Setting reliability targets and defining operational metrics (availability, latency, error budgets) in line with SRE methodologies
Requirements:
- Bachelor's degree in computer science or related field, or equivalent work experience
- 7-10 years of relevant technical and business experience in IT systems delivery, operations, and support (preferably in healthcare or high-transaction environments)
- 3+ years of experience leading technical engineering efforts involving implementation and management of IT systems
- Hands-on expertise with leading observability practices and architecture across enterprise and at scale
- Managing observability platforms and monitoring tools: Splunk, Cribl, Prometheus/Grafana, OpenTelemetry, New Relic
- Terraform, Helm, and AWS services (VPC, IAM, EC2, EKS, Istio)
- Experience bridging infrastructure and development teams, ensuring alignment of roadmaps and goals
- Strong leadership skills with the ability to motivate and guide technical teams
- Excellent communication skills, with the ability to explain complex technical concepts to both technical and non-technical stakeholders
- SaaS experience supporting large-scale, mission-critical systems
- Familiarity with IaC application deployment pipelines for packaged software (commercial and open source) and re-platforming to cloud-native environments
- Knowledge of service mesh concepts (Istio, Linkerd, etc.)
- Background in metrics-driven reliability engineering (SLOs, SLIs, error budgets)
- Experience with scripting/programming (JavaScript for Cribl, Python, etc.)