Checkfront is hiring a Staff DevOps Engineer for its new product, Manifest, in a fast-moving environment. This role involves owning critical infrastructure, improving developer experience, and collaborating with product engineers and DevOps leadership.
Responsibilities:
- Work on a team with two other platform engineers
- Own and evolve the infrastructure that supports Manifest, including AWS environments, networking, compute, data services, observability, CI/CD, and operational tooling
- Work with Pulumi and TypeScript to define, maintain, and improve infrastructure as code across the platform
- Support and improve our containerized application platform, including deployment pipelines, rollback mechanisms, and runtime configuration
- Help operate and harden our data infrastructure, including connection pooling, backups, disaster recovery, replication, and safe schema-change practices
- Partner with engineers to improve the reliability and safety of releases, including database migrations, deployment workflows, environment management, and production readiness checks
- Improve CI/CD workflows so that builds, tests, infrastructure changes, and deployments are fast, reliable, and easy for engineers to understand
- Lead observability and incident readiness work, including alerting, dashboards, SLOs, runbooks, incident response practices, and post-incident follow-up
- Help ensure the platform is secure, cost-conscious, and maintainable as the product scales
- Mentor engineers on infrastructure, operations, reliability, and production ownership
Requirements:
- Deep production experience with AWS, especially services such as ECS/Fargate, RDS/Aurora PostgreSQL, VPC networking, load balancing, IAM, KMS, Secrets Manager, CloudFront, WAF, and related managed services
- Experience designing and operating systems that serve a global user base, seamless multi-region availability, and disaster recovery procedures
- Treats reliability, scalability, performance, and observability as a first-class design constraint, building these into designs from the start rather than bolting them on later
- Strong infrastructure-as-code experience. Pulumi with TypeScript is ideal, but deep experience with Terraform or another mature IaC approach is also valuable
- Strong operational knowledge of PostgreSQL, including performance investigation, connection pooling, backups, replication, locking, migrations, and safe schema-change patterns
- Experience designing and maintaining CI/CD systems, ideally with GitHub Actions, OIDC-based cloud authentication, container builds, environment promotion, required checks, and deployment gates
- Experience supporting containerized production workloads and improving deployment safety, rollback strategies, and runtime reliability
- Strong observability and incident response experience, including metrics, logs, traces, alerting, dashboards, runbooks, and post-incident learning
- The ability to work effectively in ambiguity, make pragmatic tradeoffs, and communicate clearly with both infrastructure specialists and product engineers
- A track record of raising the engineering bar through reusable patterns, documentation, automation, mentoring, and thoughtful technical leadership