Role Overview

Own the Platform: Design, build, and maintain Pantomath's cloud infrastructure on AWS (EC2, EKS, IAM, ALB, RDS, S3) using Infrastructure as Code principles (Terraform, CDK).
Architect and evolve CI/CD pipelines (GitHub Actions, NX) that enable development teams to ship with speed, confidence, and consistency.
Lead the incident response lifecycle — own runbooks, drive resolution, and conduct blameless postmortems that harden the platform for the future.
Manage BAU operations including backups, credential rotation, log retention, and system administration with operational discipline.
Engineer for Reliability and Security: Apply zero trust and least privilege design patterns to authorization, authentication, networking, and runtime threat detection across the platform.
Partner with leadership to maintain SOC2-compliant infrastructure practices and proactively close security gaps before they become incidents.
Implement and manage robust observability tooling (Datadog, CloudWatch, Prometheus) — define standards for logging, metrics, and alerting that give every team real-time platform visibility.
Support agent observability for connector services central to Pantomath's autonomous remediation engine.
Drive Efficiency and Scale: Establish cost dashboards, conduct bi-weekly reviews, and implement right-sizing, idle shutdown, and shared infrastructure patterns that meaningfully reduce cloud spend.
Lead migration to shared ALB patterns and optimize EKS autoscaling to support rapid customer and product growth.
Contribute to multi-region readiness strategy and proactively address AWS service limits and scalability bottlenecks before they impact customers.
Reduce friction for developers — automate manual provisioning, clean up IaC repositories, and streamline dev and staging environments so engineers can move fast.
Shape the Engineering Culture: Champion DevOps and SRE best practices within an Agile/Scrum framework across multiple engineering pods.
Drive the infrastructure roadmap and platform strategy in close partnership with the VP of Engineering and company leadership.
Contribute to system architecture discussions and mentor engineers across the organization on reliability and operational excellence.

Requirements

Bachelor's degree in Computer Science, Information Systems, or a related field, or equivalent practical experience.
5+ years of experience in Site Reliability, Platform Engineering, DevOps, or Cloud Engineering — ideally in a high-growth startup environment.
Demonstrated track record of owning platform initiatives end-to-end, from design through production operation.
Proven experience operating within an Agile/Scrum development methodology.
Deep AWS expertise across core services (EC2, EKS, IAM, ALB, RDS, S3) and strong hands-on experience with Terraform or comparable IaC tools.
Solid CI/CD knowledge, preferably with GitHub Actions, and the ability to build pipelines that accelerate engineering without sacrificing safety.
Proficiency with observability tooling (Datadog, Prometheus, CloudWatch) and the judgment to define meaningful alerting standards across a distributed platform.
Strong command of security best practices — least privilege, secret management, zero trust networking, and runtime threat detection.
Proficiency in at least one scripting language (Python, Bash) for automation, tooling, and infrastructure management.
Proficient in leveraging AI coding assistants and committed to evolving SDLC workflows to maximize the impact of AI-driven development.
Excellent problem-solving, communication, and cross-functional collaboration skills.

Tech Stack

AWS
Cloud
EC2
Prometheus
Python
SDLC
Terraform

Benefits

Equal Opportunity Employer
Reasonable accommodations offered

Senior Site Reliability Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits