Mount Sinai Health System is one of the largest academic medical systems in the New York metro area, and they are seeking a Site Reliability & Observability Engineer to manage and optimize their Dynatrace observability platform. This role involves ensuring full-stack visibility across applications and infrastructure while collaborating with various IT teams to enable proactive monitoring and automated issue detection.
Responsibilities:
- Manage the end‑to‑end administration of the Dynatrace SaaS/Managed environment, including tenant management, security settings, tagging, and configuration policies
- Deploy, upgrade, and maintain OneAgent, ActiveGate, and related components across hybrid and multi‑cloud environments
- Maintain governance over dashboards, alerts, maintenance windows, management zones, and role‑based access controls (RBAC)
- Define and maintain consistent tagging strategies to support service mapping, ownership visibility, and automated root‑cause detection
- Develop and maintain high‑value dashboards for technical teams, leadership, and business stakeholders
- Provide monthly/quarterly observability reports covering performance trends, risks, capacity insights, and optimization opportunities
- Distill complex technical findings into clear executive‑friendly communication
- Perform related duties as assigned or requested
Requirements:
- Bachelors degree in a technical discipline; Masters degree preferred
- 12-15 years preferred of related experience, including 8 years of demonstrated ability in technology area. In-depth knowledge of associated technology areas that could impact area of responsibility; healthcare technology experience preferred
- 3-5+ years of experience in application performance monitoring (APM), observability, or enterprise monitoring
- Hands-on experience with Dynatrace administration (SaaS or Managed)
- Strong understanding of cloud platforms (AWS, Azure, or GCP), Kubernetes, Linux/Windows systems, and networking fundamentals
- Familiarity with logs/metrics/traces, synthetic monitoring, and distributed tracing concepts
- Experience with automation and scripting (PowerShell, Python, Bash, YAML, Terraform preferred)
- Ability to troubleshoot complex application, network, and infrastructure performance issues