Mount Sinai Health System is one of the largest academic medical systems in the New York metro area, and they are seeking a Site Reliability & Observability Engineer to manage and optimize their Dynatrace observability platform. This role involves ensuring full-stack visibility across applications and infrastructure while collaborating with various IT teams to enable proactive monitoring and automated issue detection.

Responsibilities:

Manage the end‑to‑end administration of the Dynatrace SaaS/Managed environment, including tenant management, security settings, tagging, and configuration policies
Deploy, upgrade, and maintain OneAgent, ActiveGate, and related components across hybrid and multi‑cloud environments
Maintain governance over dashboards, alerts, maintenance windows, management zones, and role‑based access controls (RBAC)
Define and maintain consistent tagging strategies to support service mapping, ownership visibility, and automated root‑cause detection
Develop and maintain high‑value dashboards for technical teams, leadership, and business stakeholders
Provide monthly/quarterly observability reports covering performance trends, risks, capacity insights, and optimization opportunities
Distill complex technical findings into clear executive‑friendly communication
Perform related duties as assigned or requested

Requirements:

Bachelors degree in a technical discipline; Masters degree preferred
12-15 years preferred of related experience, including 8 years of demonstrated ability in technology area. In-depth knowledge of associated technology areas that could impact area of responsibility; healthcare technology experience preferred
3-5+ years of experience in application performance monitoring (APM), observability, or enterprise monitoring
Hands-on experience with Dynatrace administration (SaaS or Managed)
Strong understanding of cloud platforms (AWS, Azure, or GCP), Kubernetes, Linux/Windows systems, and networking fundamentals
Familiarity with logs/metrics/traces, synthetic monitoring, and distributed tracing concepts
Experience with automation and scripting (PowerShell, Python, Bash, YAML, Terraform preferred)
Ability to troubleshoot complex application, network, and infrastructure performance issues

Site Reliability & Observability Engineer (Technology Specialist IV) - Digital and Technology Partners - Remote

Key skills

About this role

Responsibilities:

Requirements: