Pantomath is building an autopilot for the data-driven enterprise, aiming to automate the lifecycle of data reliability. The Sr. Site Reliability Engineer will be a senior technical leader responsible for the availability, security, performance, and scalability of the platform, while driving infrastructure strategy and ensuring reliability excellence across the organization.
Responsibilities:
- Design, build, and maintain Pantomath's cloud infrastructure on AWS (EC2, EKS, IAM, ALB, RDS, S3) using Infrastructure as Code principles (Terraform, CDK)
- Architect and evolve CI/CD pipelines (GitHub Actions, NX) that enable development teams to ship with speed, confidence, and consistency
- Lead the incident response lifecycle — own runbooks, drive resolution, and conduct blameless postmortems that harden the platform for the future
- Manage BAU operations including backups, credential rotation, log retention, and system administration with operational discipline
- Apply zero trust and least privilege design patterns to authorization, authentication, networking, and runtime threat detection across the platform
- Partner with leadership to maintain SOC2-compliant infrastructure practices and proactively close security gaps before they become incidents
- Implement and manage robust observability tooling (Datadog, CloudWatch, Prometheus) — define standards for logging, metrics, and alerting that give every team real-time platform visibility
- Support agent observability for connector services central to Pantomath's autonomous remediation engine
- Establish cost dashboards, conduct bi-weekly reviews, and implement right-sizing, idle shutdown, and shared infrastructure patterns that meaningfully reduce cloud spend
- Lead migration to shared ALB patterns and optimize EKS autoscaling to support rapid customer and product growth
- Contribute to multi-region readiness strategy and proactively address AWS service limits and scalability bottlenecks before they impact customers
- Reduce friction for developers — automate manual provisioning, clean up IaC repositories, and streamline dev and staging environments so engineers can move fast
- Champion DevOps and SRE best practices within an Agile/Scrum framework across multiple engineering pods
- Drive the infrastructure roadmap and platform strategy in close partnership with the VP of Engineering and company leadership
- Contribute to system architecture discussions and mentor engineers across the organization on reliability and operational excellence
Requirements:
- Bachelor's degree in Computer Science, Information Systems, or a related field, or equivalent practical experience
- 5+ years of experience in Site Reliability, Platform Engineering, DevOps, or Cloud Engineering — ideally in a high-growth startup environment
- Demonstrated track record of owning platform initiatives end-to-end, from design through production operation
- Proven experience operating within an Agile/Scrum development methodology
- Deep AWS expertise across core services (EC2, EKS, IAM, ALB, RDS, S3) and strong hands-on experience with Terraform or comparable IaC tools
- Solid CI/CD knowledge, preferably with GitHub Actions, and the ability to build pipelines that accelerate engineering without sacrificing safety
- Proficiency with observability tooling (Datadog, Prometheus, CloudWatch) and the judgment to define meaningful alerting standards across a distributed platform
- Strong command of security best practices — least privilege, secret management, zero trust networking, and runtime threat detection
- Proficiency in at least one scripting language (Python, Bash) for automation, tooling, and infrastructure management
- Proficient in leveraging AI coding assistants and committed to evolving SDLC workflows to maximize the impact of AI-driven development
- Excellent problem-solving, communication, and cross-functional collaboration skills
- Experience designing and operating multi-region AWS architectures at scale
- Prior work in a SOC2-compliant environment with direct involvement in audit readiness
- Track record of measurably reducing cloud spend through architectural and operational improvements
- Familiarity with container networking, ALB/NGINX routing, and EKS tuning
- Experience supporting data infrastructure or AI/ML workloads in production environments