Kontakt.io is building a platform that automates and orchestrates clinical workflows in hospitals, enhancing efficiency through AI and real-time data. They are seeking a Lead Software Engineer - SRE to drive the reliability, scalability, and performance of their AWS-based platform, while mentoring engineers and shaping technical strategy.
Responsibilities:
- Lead the design and implementation of scalable, fault-tolerant, and self-healing infrastructure and services across AWS and Kubernetes
- Collaborate with Product, Engineering, and Infrastructure teams to align SRE initiatives with business priorities and platform needs
- Define and drive adoption of SLIs, SLOs, and SLAs to ensure consistent performance and high reliability across the platform
- Own and evolve observability strategies using Prometheus, OpenTelemetry, Grafana, and related tooling
- Design and maintain infrastructure as code (Terraform) and drive GitOps best practices
- Oversee major incident response and on-call practices, including incident reviews and long-term remediation planning
- Mentor and support the growth of SRE and platform engineers, fostering a culture of engineering rigor and operational excellence
- Contribute to the long-term reliability roadmap and architecture of high-throughput, real-time systems in healthcare operations
- Drive process improvements in CI/CD, service ownership, chaos engineering, disaster recovery, and secure deployment
Requirements:
- 5+ years of experience in Site Reliability Engineering, Cloud Infrastructure, or Platform Engineering
- 5+ years of software engineering experience building production-grade systems (Java, Python, Go, or similar)
- Proven success scaling high-traffic, mission-critical platforms in SaaS, IoT, or healthcare environments
- Deep expertise in cloud platforms (especially AWS), Kubernetes, and distributed system architecture
- Hands-on experience with monitoring, logging, and observability tools (Prometheus, OpenTelemetry, Datadog, etc.)
- Extensive knowledge of CI/CD automation, GitOps workflows, and infrastructure-as-code (Terraform, Helm, ArgoCD)
- A track record of leading major incident response and running postmortems with a blameless, learning-focused approach
- Strong understanding of networking, access control, and security within regulated environments (HIPAA, SOC 2)
- A leadership mindset—able to drive cross-functional alignment, lead initiatives, and mentor a high-performance SRE team