Medeloop is seeking a Senior DevOps & Site Reliability Engineer to own the reliability, scalability, performance, and operational excellence of its platform. The role involves blending deep DevOps engineering with SRE disciplines to ensure clinical research products are always available, performant, and secure for healthcare organizations.
Responsibilities:
- Design, implement, and manage scalable, secure, and highly available cloud infrastructure on AWS - infrastructure as code (IaC) using AWS CDK, CloudFormation, or Terraform, ensuring all environments are version-controlled and reproducible
- Architect multi-region and disaster recovery strategies that meet healthcare uptime requirements
- Manage containerized workloads using Docker and Kubernetes, optimizing for cost, performance, and resilience
- Define, implement, and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across all production services
- Build and maintain observability stacks (DataDog, AWS CloudWatch, Sentry) covering metrics, logs, traces, and alerting
- Lead incident response: triage, mitigate, and drive blameless post-incident reviews with actionable follow-ups
- Conduct capacity planning and performance engineering to ensure the platform scales ahead of demand
- Champion error budgets and use them to balance feature velocity with system stability
- Identify, assess, and mitigate operational risks by collaborating with engineering and product teams to evaluate impact and likelihood before they become incidents
- Participate in and help structure an on-call rotation, ensuring clear escalation paths and fair distribution of after-hours coverage
- Build self-service tooling and runbooks that reduce toil and empower development teams to ship independently
- Design and maintain CI/CD pipelines (GitHub Actions) that enable fast, safe, and repeatable deployments
- Automate security scanning (SAST, DAST) within pipelines and collaborate with engineering to remediate findings
- Implement progressive delivery strategies such as canary deployments, blue-green releases, and feature flags
- Proficiency in scripting languages (Python, Bash) for automation, troubleshooting, and building reliability tooling
- Track and drive down operational toil, targeting less than 50% of team time spent on repetitive manual work
- Evaluate and manage change risk for production deployments, maintaining change review processes that balance speed with stability
- Ensure infrastructure meets healthcare compliance standards (HIPAA, SOC 2) through policy-as-code, encryption, and access controls
- Manage networking security (VPCs, subnets, security groups, WAFs) and identity/authentication systems (AWS Cognito, Auth0, OAuth2, SSO)
- Conduct regular security reviews, vulnerability assessments, and patching across the infrastructure estate
- Partner closely with product and engineering teams to embed reliability thinking into the software development lifecycle
- Develop and maintain comprehensive documentation for infrastructure, runbooks, and operational playbooks
- Mentor junior engineers on DevOps and SRE best practices, fostering a culture of ownership and continuous improvement
- Stay current with advancements in cloud technologies, DevOps tooling, and SRE methodologies
- Own and evolve internal developer platform tooling — including deployment workflows (GitOps/Flux), bug tracking integrations, and developer self-service portals
Requirements:
- Bachelor's or Master's degree in Computer Science, Information Technology, or a related field
- 7+ years of combined experience in DevOps and/or Site Reliability Engineering roles, with at least 2 years in a senior capacity
- Deep proficiency with AWS services
- Deep experience with observability and monitoring platforms such as DataDog, AWS CloudWatch, and Sentry
- Strong experience building and maintaining CI/CD pipelines with GitHub Actions or equivalent tools
- Expertise in infrastructure as code using AWS CDK, CloudFormation, or Terraform
- Hands-on experience with containerization (Docker) and orchestration (Kubernetes)
- Proven track record of defining and operating against SLOs/SLIs and managing incident response processes
- Solid understanding of networking (VPCs, subnets, load balancing, DNS), security, and compliance best practices
- Experience with authentication and authorization systems including AWS Cognito, Auth0, OAuth2, and SSO
- Proactive, self-directed mindset with a bias toward action and taking initiative
- Excellent problem-solving skills and the ability to work independently as well as collaboratively across teams
- Strong communication skills—able to explain complex infrastructure decisions clearly to technical and non-technical stakeholders
- Passion for unsolved challenges in healthcare AI, with the ability to thrive in a fast-paced, multidisciplinary environment and wear multiple hats
- Multi-cloud experience (AWS, Azure, GCP)
- Familiarity with healthcare data standards, compliance, and protocols such as HIPAA, HL7 FHIR, OMOP, and i2b2
- Experience with chaos engineering practices and tools (e.g., AWS Fault Injection Simulator, Gremlin)
- Prior experience in a healthcare or life sciences company operating under strict regulatory requirements
- Contributions to open-source infrastructure or SRE tooling
- Relevant certifications such as AWS Solutions Architect, Certified Kubernetes Administrator (CKA), or Google SRE certification