Vynca is dedicated to transforming care for individuals with complex needs. They are seeking a Site Reliability Engineer to build and operate the infrastructure for their healthcare technology platform, focusing on reliability, scalability, and security of their systems.
Responsibilities:
- Design, provision, and manage AWS infrastructure using Terraform as the source of truth
- Operate, maintain, and scale production workloads running on Kubernetes
- Package, deploy, and manage applications using Helm and infrastructure automation tools
- Build, operate, and improve distributed and event-driven systems, including event sourcing, partitioning, event ordering, replay, and failure recovery mechanisms
- Define, monitor, and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to balance reliability and engineering velocity
- Develop automation for deployment, scaling, monitoring, incident response, and operational workflows to reduce manual effort and improve system resilience
- Own platform observability by implementing and maintaining metrics, logging, tracing, monitoring, and alerting solutions
- Lead incident response efforts, facilitate blameless postmortems, and drive long-term corrective actions that improve system reliability
- Partner with Product and Engineering teams on capacity planning, performance optimization, and resilient system design
- Implement and maintain security best practices to support HIPAA, SOC 2, and other compliance requirements
- Participate in an on-call rotation and provide operational support for production systems
Requirements:
- Three to five (3–5) years of experience in Site Reliability Engineering, DevOps Engineering, Platform Engineering, Cloud Infrastructure Engineering, or similar infrastructure-focused roles, preferably within healthcare, SaaS, or high-growth technology environments
- Bachelor's degree in Computer Science, Information Systems, Software Engineering, or a related technical field; equivalent professional experience will also be considered
- Strong hands-on experience operating production workloads within AWS environments
- Proven experience managing infrastructure as code using Terraform, including module development, state management, and deployment automation
- Experience operating and supporting production Kubernetes environments
- Hands-on experience deploying and managing applications using Helm
- Experience working with distributed systems, event-driven architectures, or event-sourcing platforms, including concepts such as partitioning, event ordering, replay, and fault tolerance
- Experience establishing and managing observability practices including monitoring, logging, tracing, alerting, and incident response
- Strong understanding of Linux systems administration, networking, cloud architecture, and distributed systems fundamentals
- Experience designing, implementing, and maintaining CI/CD pipelines and deployment automation
- Strong problem-solving skills with the ability to troubleshoot complex infrastructure and application issues
- Excellent written and verbal communication skills with the ability to collaborate effectively across technical and non-technical teams
- High level of ownership, accountability, and initiative with a proactive approach to reliability and operational excellence
- Ability and willingness to participate in an on-call rotation supporting production systems
- Strong programming or scripting experience with Python, Go, or similar languages
- Experience with observability platforms such as Prometheus, Grafana, Datadog, CloudWatch, SigNoz, or OpenTelemetry
- Experience with GitOps tools such as ArgoCD or Flux
- Experience managing databases such as PostgreSQL, MySQL, Redshift, or ClickHouse
- Experience implementing secrets management solutions such as AWS Secrets Manager or HashiCorp Vault
- Experience supporting healthcare technology platforms or other highly regulated environments
- Familiarity with data infrastructure technologies including Snowflake, Redshift, and ETL/ELT pipelines
- Experience with database performance tuning and optimization