Design, deploy, and manage scalable and highly available cloud infrastructure on AWS, with deep expertise in core services (EC2, EKS, S3, RDS, IAM, VPC, and beyond).
Develop and maintain disaster recovery plans leveraging AWS capabilities for backup and replication to ensure business continuity.
Collaborate with engineering and security teams to improve infrastructure health, security, and long-term scalability.
Design reusable Terraform/OpenTofu modules following DRY principles and organizational standards; implement module versioning and lifecycle strategies.
Direct the migration of manual infrastructure to code; establish patterns and best practices for IaC adoption across the team.
Implement IaC testing strategies, including validation, linting, and integration testing, using tools such as Terraform-Compliance or Checkov.
Architect and maintain complex Bitbucket pipeline configurations for multi-environment IaC deployments; implement pipeline security best practices.
Implement AIOps practices, leveraging AI tools to enhance monitoring, incident response, and predictive alerting.
Use AI-assisted development and operations tools (e.g., Cursor, Claude) to accelerate troubleshooting, code review, and documentation generation.
Evaluate and implement AI-powered automation to reduce operational toil, improve repeatability, and scale platform capabilities.
Define and implement SLOs for services; guide and/or participate in incident response and conduct blameless postmortems.
Implement chaos engineering practices to proactively identify system weaknesses before they impact production.
Build and maintain comprehensive monitoring solutions using tools such as CloudWatch and Datadog to track performance and drive optimization.
Develop automation scripts and tools in Python, Bash, or similar languages to streamline operations and eliminate manual toil.
Build self-service capabilities for development teams to reduce cognitive load and enable developer autonomy across the organization.
Guide the solution architecture and end-to-end implementation of DistroKid’s first Internal Developer Portal (IDP).
Define the IDP roadmap and success criteria in partnership with engineering leadership; establish golden paths, service catalogs, and self-service workflows that reduce deployment friction and accelerate developer productivity.
Drive adoption of the IDP across engineering teams; gather feedback, iterate on the platform, and measure impact through developer experience metrics and reduced time-to-deploy.
Guide cost optimization initiatives; implement rightsizing recommendations, reserved-capacity strategies, and tagging standards for cost allocation.
Monitor and optimize AWS resource usage; select appropriate services and configurations to meet performance requirements cost-effectively.
Direct planning, decision-making, and execution for infrastructure projects; own workstreams end-to-end.
Partner cross-functionally with engineering, security, and product teams; communicate impact in terms of company strategy and OKRs.
Provide technical mentorship to junior and mid-level engineers; invest in team growth and foster a culture of continuous learning.
Maintain and contribute to infrastructure documentation, runbooks, and architectural decision records to ensure knowledge sharing and operational consistency.
Requirements
Bachelor’s degree in Computer Science, Information Technology, a related field, or equivalent practical experience.
5+ years of experience in systems operations, platform engineering, or DevOps with a focus on cloud infrastructure and containerized environments.
Proven production experience with AWS services (EC2, EKS, S3, RDS, IAM, VPC, API Gateway, Event Bridge, etc) and Kubernetes.
5+ years of hands-on experience with Infrastructure as Code tools, specifically Terraform and/or OpenTofu, including module design, state management, remote backends, and IaC testing.
Strong knowledge of Linux/Unix administration, systems, and shell scripting.
Proficiency in Python, Go, or similar programming languages.
Experience with CI/CD pipelines for infrastructure deployments (Bitbucket Pipelines, Jenkins, or similar).
Experience with monitoring and observability tools (Prometheus, Grafana, CloudWatch, or Datadog).
Demonstrated experience implementing or working with AIOps tools, practices, or AI-assisted operations in a professional context.
Experience using AI-assisted development tools (e.g., Cursor, Warp, Claude, or similar) to accelerate engineering work.