New Health Partners is seeking a Mid/Senior DevOps & Backend Engineer to bridge the gap between platform infrastructure and application development. In this role, you will design and operate cloud-native infrastructure for their InsurTech product suite while also contributing to backend services using TypeScript and Nest.js.
Responsibilities:
- Design, build, and maintain production-grade CI/CD pipelines (GitHub Actions, GitLab CI) with automated testing, security scanning, and progressive deployment strategies (blue-green, canary, feature flags)
- Manage and optimize AWS infrastructure including EKS, EC2, RDS, ECR, S3, Lambda, CloudFront, Route 53, and IAM—with a focus on cost optimization, high availability, and disaster recovery
- Build and maintain Kubernetes clusters (EKS) with Helm charts, custom operators, autoscaling policies, and multi-environment management (dev, staging, production)
- Automate infrastructure provisioning and configuration using Terraform (primary), Ansible, and CloudFormation with GitOps workflows and drift detection
- Implement comprehensive observability using Prometheus, Grafana, Datadog, ELK/OpenSearch, and distributed tracing (Jaeger/OpenTelemetry) for full-stack visibility
- Design and maintain networking architecture including VPCs, security groups, load balancers, service meshes (Istio/Linkerd), and DNS management
- Provision and manage GPU-accelerated compute environments (AWS P4/P5 instances, Inferentia, SageMaker) for LLM training, fine-tuning, and inference workloads
- Build containerized model-serving infrastructure supporting vLLM, TGI (Text Generation Inference), NVIDIA Triton, and custom inference endpoints with autoscaling based on request load and latency targets
- Design and operate data pipelines and storage architectures (S3, EFS, FSx for Lustre) optimized for large-scale model training datasets and artifact management
- Implement CI/CD automation specifically for ML/AI workflows—model versioning, automated evaluation gates, staged rollouts of model updates, and A/B inference routing
- Collaborate with the AI team to optimize GPU utilization, manage spot instance strategies, and implement cost-aware scheduling for training jobs
- Set up monitoring dashboards for model inference latency, throughput, token usage, GPU utilization, and cost tracking
- Contribute to and extend backend services built with Nest.js and TypeScript, focusing on scalability, reliability, and clean architecture
- Developing internal TypeScript framework
- Build and maintain scalable microservices and RESTful/GraphQL APIs that integrate with AI inference endpoints and the LLM Composer platform
- Design event-driven architectures using Kafka, SQS/SNS, and WebSockets for real-time data processing and AI-powered features
- Ensure all deployments are production-ready, horizontally scalable, and follow 12-factor app principles with proper health checks, graceful shutdowns, and circuit breakers
- Collaborate with backend and AI teams on system architecture, API contracts, database schema design, and reliability improvements
- Implement database management best practices including migration strategies, read replicas, connection pooling, and query optimization for PostgreSQL and Redis
Requirements:
- 4–7+ years of professional experience in DevOps, Cloud Engineering, or Platform Engineering, with meaningful backend development experience
- Hands-on Kubernetes experience (EKS strongly preferred), including cluster administration, Helm chart development, autoscaling, and troubleshooting
- Strong proficiency with TypeScript and Nest.js (or comparable Node.js backend frameworks like Express, Fastify)
- Deep AWS expertise across compute, storage, networking, IAM, and managed services—with experience optimizing for cost and performance
- Strong Infrastructure-as-Code skills with Terraform; experience with modular, reusable configurations and state management
- Solid understanding of microservices architecture, distributed systems patterns, and container orchestration
- Experience with Docker, container registries, and container security best practices
- Proficiency with CI/CD pipeline design including automated testing, security scanning, and deployment strategies
- Familiarity with GitOps workflows and version-controlled infrastructure management
- Strong Linux systems administration and shell scripting skills
- Experience provisioning and managing GPU workloads for ML/AI model training and inference in cloud environments
- Familiarity with ML model serving frameworks (vLLM, TGI, Triton, BentoML, SageMaker Endpoints)
- Experience with Kafka, event-driven architectures, and real-time streaming systems
- Familiarity with service mesh technologies (Istio, Linkerd) and API gateway management
- Experience with HIPAA, SOC 2, or other healthcare/financial compliance frameworks in cloud environments
- Knowledge of database technologies beyond PostgreSQL—vector databases (Pinecone, PGVector), graph databases, or time-series databases
- Experience with chaos engineering, load testing, and reliability engineering practices (SRE)
- AWS certifications (Solutions Architect, DevOps Engineer, or equivalent)