Artisight transforms hospital operations with its Smart Hospital Platform, helping health systems reduce costs, improve efficiency, and enhance patient care. They are seeking a Senior Site Reliability Engineer to architect healthcare technology systems that impact patient care, focusing on reliability and resilience. The role involves creating the SRE team and optimizing infrastructure for performance and efficiency.

Responsibilities:

Serve as the go-to expert for complex L2 support issues, diving deep into our stack to not just fix problems but eliminate their root causes forever
Engineer automation solutions that turn repetitive operational tasks into seamless, intelligent workflows, because manual work should be the exception, not the rule
Design and implement next-generation observability platforms that provide crystal-clear insights into system health before problems become incidents
Partner with development teams to bake scalability, reliability, and security directly into new features and architectural decisions from day one
Lead incident response with surgical precision, conducting thorough post-mortems that transform failures into learning opportunities
Mentor emerging talent across engineering teams, spreading the SRE mindset and elevating our collective technical capabilities
Hunt down performance bottlenecks across our infrastructure and applications, optimizing for speed and efficiency at scale

Requirements:

Expert-level Python (scripting, automation, tooling)
Linux proficiency (Ubuntu preferred); system admin, networking, troubleshooting
Docker (containerization)
Kubernetes: deployment, management, and troubleshooting of clusters and applications
CI/CD pipelines: able to own and improve delivery workflows
Cloud platform experience (e.g. AWS, GCP, Azure) with AWS preferred
Infrastructure as code (Terraform, Ansible, or similar): able to write and maintain IaC
Networking fundamentals (TCP/IP, DNS, Load Balancing, Firewalls): sufficient to diagnose and resolve production issues independently
Monitoring and alerting tools (Prometheus, Grafana, ELK, Datadog, New Relic): able to design and implement coverage

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: