Design, automate, deploy, and operate highly reliable cloud systems supporting mission-critical workloads for U.S. Government customers.
Ensure systems are observable, fault-tolerant, and require minimal manual intervention.
Drive consistency, reliability, and performance across all environments.
Help define and implement best practices for availability, latency, incident response, and service-level objectives (SLOs).
Participate in incident response and a 24/7 on-call rotation,
Collaborate closely with frontend, backend, and platform engineers to ensure systems meet performance, reliability, and mission assurance requirements.
Requirements
Bachelor’s degree in Computer Science or related field
3+ years of professional experience as an SRE, DevOps, reliability, infrastructure, or platform engineer
Active U.S. Security Clearance (Secret or higher required; TS/SCI preferred); U.S. Citizenship required
Experience working toward ATO/authorization in federal, DoD, or IC environments preferred
Experience supporting deployments in GovCloud, C2S/C2E, or IL-enclave environments highly desirable
Strong experience with Kubernetes and containerized workloads in production environments
Hands-on experience operating clusters in AWS EKS, Rancher, or similar platforms
Deep experience with CI/CD systems and deployment automation (GitLab preferred)
Proficiency in Python and Infrastructure-as-Code tools (Terraform or similar)
Experience with observability platforms (Grafana LGTM stack, Datadog, or equivalent)
Strong understanding of distributed systems, APIs, databases, caching, and event-driven architectures
Solid networking fundamentals (VPCs, VPNs, load balancers, TLS, service connectivity)
Experience with Linux/Unix systems
Familiarity with cloud security best practices, enclave boundaries, and secure system design
Experience with identity and access management (AWS IAM, Auth0, Keycloak, ICAM patterns)
Strong Git fundamentals and experience supporting deployments across multiple classification levels
Tech Stack
AWS
Cloud
Distributed Systems
Grafana
Kubernetes
Linux
Python
Terraform
TypeScript
Unix
Benefits
We take work life balance very seriously. We require employees to take 15 days off but provide unlimited PTO and follow most US federal government holidays.
Mental health is just as important as physical so we provide quarterly health & wellness benefits.
Comprehensive health insurance for you and your family with 100% coverage for employees.
We encourage employees to save for retirement and provide 4% 401(k) matching.
Annually we have a 4-day company offsite. Previous locations include San Francisco, Nashville, Denver, Santa Fe, New Orleans, San Diego, Bozeman, and New York City.
Our culture and company is evolving. You will be key in creating the next major or minor version!