Design, build, and maintain Azure-based infrastructure, with a primary focus on Azure Kubernetes Service (AKS) for reliability, scalability, and developer experience.
Architect and operate infrastructure to support continuous availability — including zero-downtime deployments, automated rollouts, and the ability to scale capacity up and down in response to predictable demand peaks and quiet periods throughout the year.
Own system reliability and maintenance practices, including patching, upgrades, and configuration management across environments, ensuring infrastructure remains healthy, current, and audit-ready.
Develop and maintain disaster recovery and business continuity plans — including documented runbooks, tested recovery procedures, rollback strategies, and data recovery protocols that can be executed confidently when needed.
Develop and document reusable tools, networking patterns, and infrastructure templates for engineering teams to follow.
Collaborate cross-functionally with engineering teams when infrastructure changes are coming, or when working with them to understand what they need.
Own and improve CI/CD pipelines using GitHub Actions ensuring fast, reliable, and secure delivery of workflows.
Manage infrastructure-as-code using Terraform, enabling repeatable and auditable provisioning across environments.
Implement and maintain observability and monitoring solutions, including Grafana dashboards and alerting, to provide teams with clear visibility into system health.
Manage identity and access using Microsoft Entra ID, applying least-privilege principles across services and teams.
Approach all infrastructure work with a security-first mindset — proactively identifying risks, enforcing compliance patterns, and communicating deviations from standard operating procedures.
Communicate clearly with stakeholders and adjacent teams on infrastructure changes, timelines, and dependencies.
Contribute to the team's knowledge base by creating runbooks, architecture documentation, and onboarding guides.
Requirements
4+ years of hands-on experience with Azure infrastructure in a production environment.
Deep experience with Azure Kubernetes Service (AKS) — cluster management, networking, scaling, gitops and day-2 operations.
Strong understanding of cloud networking, including VNets, NSGs, private endpoints, DNS, and ingress/egress patterns.
Experience with infrastructure-as-code — Terraform preferred.
Proficiency with CI/CD tooling, particularly GitHub Actions.
Comfort working in a small, remote team with a high degree of autonomy and ownership.
Strong written and verbal communication skills — able to work cross-functionally, explain technical decisions clearly, and keep stakeholders informed.
Security-conscious approach to infrastructure design and operations.
Eastern or Central time zone required for team collaboration.