Artisight transforms hospital operations with its Smart Hospital Platform, helping health systems reduce costs, improve efficiency, and enhance patient care. They are seeking a Senior Site Reliability Engineer to architect healthcare technology systems that impact patient care, focusing on reliability and resilience. The role involves creating the SRE team and optimizing infrastructure for performance and efficiency.
Responsibilities:
- Serve as the go-to expert for complex L2 support issues, diving deep into our stack to not just fix problems but eliminate their root causes forever
- Engineer automation solutions that turn repetitive operational tasks into seamless, intelligent workflows, because manual work should be the exception, not the rule
- Design and implement next-generation observability platforms that provide crystal-clear insights into system health before problems become incidents
- Partner with development teams to bake scalability, reliability, and security directly into new features and architectural decisions from day one
- Lead incident response with surgical precision, conducting thorough post-mortems that transform failures into learning opportunities
- Mentor emerging talent across engineering teams, spreading the SRE mindset and elevating our collective technical capabilities
- Hunt down performance bottlenecks across our infrastructure and applications, optimizing for speed and efficiency at scale
Requirements:
- Expert-level Python (scripting, automation, tooling)
- Linux proficiency (Ubuntu preferred); system admin, networking, troubleshooting
- Docker (containerization)
- Kubernetes: deployment, management, and troubleshooting of clusters and applications
- CI/CD pipelines: able to own and improve delivery workflows
- Cloud platform experience (e.g. AWS, GCP, Azure) with AWS preferred
- Infrastructure as code (Terraform, Ansible, or similar): able to write and maintain IaC
- Networking fundamentals (TCP/IP, DNS, Load Balancing, Firewalls): sufficient to diagnose and resolve production issues independently
- Monitoring and alerting tools (Prometheus, Grafana, ELK, Datadog, New Relic): able to design and implement coverage