Koda Health is looking for a Senior Infrastructure & Security Engineer to own the reliability, security, and operational health of their production systems. The role involves managing AWS infrastructure, security compliance, and contributing to the codebase while ensuring the platform's operational integrity and security posture.
Responsibilities:
- Own the operational health of production across two AWS regions
- Investigate production issues, lead root-cause analysis, and drive resolution
- Build and maintain dashboards that give real-time visibility into application health, queue depths, API latency, and error rates
- Monitor SQS/SNS queue health, dead-letter queues, and event processing pipelines
- Expand observability beyond CloudWatch - evaluate and implement distributed tracing, APM, and log aggregation
- Oversee weekly deployments to production
- Own cost monitoring and alerting (Budget alerts, Cost Explorer)
- Improve automated uptime and SLA reporting
- Own and evolve all AWS infrastructure defined in CDK
- Lead the migration to capturing 100% of cloud infrastructure in CDK
- Manage and improve services: Lambda, ECS Fargate, Elastic Beanstalk, S3, CloudFront, SNS, SQS, EventBridge, WAF, Cognito
- Support multi-region uptime, disaster recovery planning, and backup/restore practices
- Improve cross-region replication and automated failover
- Own deployment pipelines, release processes, and database migration safety
- Support and evolve data pipelines used for analytics and product features
- Set standards for how we ship, deploy, and operate software at scale
- Maintain and harden AWS infrastructure with a strong security mindset
- Own vulnerability remediation and SLA timelines
- Help respond to security questionnaires and vendor assessments
- Own and improve WAF rules, security groups, IAM policies, and network configuration
- Own SecurityHub, AWS Config, VPC Flow Logs, and CloudTrail
- Support GuardDuty malware scanning and S3 upload security
- Ensure SOC 2 and HIPAA compliance across infrastructure
- Manage secrets, key rotation, and access controls
- Conduct periodic security reviews of infrastructure and application configuration
- Triage and fix production errors surfaced by Sentry
- Make small TypeScript PRs to backend services
- Debug complex production issues that span infrastructure and application code
- Participate in architecture discussions, especially around infrastructure and deployment concerns
Requirements:
- 6+ years building and operating production systems on AWS
- Strong experience with AWS CDK (we use CDK in typescript)
- Deep knowledge of core AWS services: Lambda, ECS, S3, CloudWatch, SNS, SQS, IAM, VPC, WAF
- Experience setting up and managing monitoring, alerting, and incident management
- Experience with security hardening and compliance in regulated environments (HIPAA, SOC 2, or similar)
- Working knowledge of TypeScript or Node.js - enough to read the codebase, make PRs, and debug production issues
- Experience with CI/CD pipelines (CodePipeline, GitHub Actions, or similar)
- Comfortable owning production systems end-to-end in a small team where you're the expert
- Strong English fluency - written & verbal communication (security questionnaire responses, etc)
- US-based, able to work CST/EST hours (contractual requirement)
- Healthcare industry experience (FHIR, HL7v2, Epic/Cerner integrations)
- Experience with multi-region AWS architectures and disaster recovery
- Experience with MongoDB operations and performance
- Experience with cost optimization in AWS
- Familiarity with AI-assisted development tools (e.g., Claude Code)