Artisight transforms hospital operations with its Smart Hospital Platform, helping health systems reduce costs and improve efficiency. The Senior Site Reliability Engineer will architect healthcare technology systems, ensuring reliability and building resilient infrastructures that enhance patient care.
Responsibilities:
- Serve as the go-to expert for complex L2 support issues, diving deep into our stack to not just fix problems but eliminate their root causes forever
- Engineer automation solutions that turn repetitive operational tasks into seamless, intelligent workflows, because manual work should be the exception, not the rule
- Design and implement next-generation observability platforms that provide crystal-clear insights into system health before problems become incidents
- Partner with development teams to bake reliability directly into new features and architectural decisions from day one
- Lead incident response with surgical precision, conducting thorough post-mortems that transform failures into learning opportunities
- Mentor emerging talent across engineering teams, spreading the SRE mindset and elevating our collective technical capabilities
- Hunt down performance bottlenecks across our infrastructure and applications, optimizing for speed and efficiency at scale
Requirements:
- Expert-level Python (scripting, automation, tooling)
- Linux proficiency (Ubuntu preferred); system admin, networking, troubleshooting
- Docker (containerization)
- Kubernetes: deployment, management, and troubleshooting of clusters and applications
- CI/CD pipelines: able to own and improve delivery workflows
- Infrastructure as code (Terraform, Ansible, or similar): able to write and maintain IaC
- Networking fundamentals (TCP/IP, DNS, Load Balancing, Firewalls): sufficient to diagnose and resolve production issues independently
- Monitoring and alerting tools (Prometheus, Grafana, ELK, Datadog, New Relic): able to design and implement coverage