Design and maintain scalable, fault-tolerant infrastructure that supports our SaaS platform and keeps pace with business growth.
Instrument observability best practices—embracing tracing-first approaches, meaningful metrics, and monitoring that actually helps during incidents.
Define, document, and maintain SLIs, SLOs, and SLAs in partnership with product engineering, translating business commitments into technical guardrails.
Build automation that eliminates manual intervention across CI/CD, deployments, configuration management, and recovery—because your time is better spent on strategic problems.
Lead incident response with steady judgment, facilitate blameless postmortems, and drive remediation efforts that prevent recurrence.
Partner with engineering and product teams during design reviews to ensure new features are production-ready and operationally scalable.
Optimize infrastructure costs through performance tuning, capacity planning, and smart use of cloud resources.
Mentor engineers on operational best practices and champion reliability thinking across the organization.
Document infrastructure architecture clearly and maintain the kind of runbooks that your future self will thank you for.
Requirements
4+ years of experience in SRE, DevOps, or infrastructure engineering roles, with demonstrated experience supporting SaaS platforms in production.
Expert-level knowledge of an infrastructure-as-code framework (Pulumi, Terraform, CDK)—you should be the kind of person who thinks "if it's not in code, it doesn't exist."
Strong working knowledge of AWS (or equivalent cloud platforms), including designing for availability, scalability, and security.
Proficiency in TypeScript or Python for infrastructure automation and tooling.
Experience with containerization and orchestration (ECS Fargate, Kubernetes, or similar).
Deep familiarity with observability tools and practices (OpenTelemetry, CloudWatch, Honeycomb)—bonus points if you embrace a tracing-first philosophy.
Solid understanding of networking, load balancing, and distributed systems concepts.
Experience with CI/CD tooling (GitHub Actions, CodeBuild, or equivalent).
The ability to communicate complex operational issues clearly to both technical and non-technical stakeholders.
Calm effectiveness during high-pressure incidents and the judgment to balance competing priorities like performance, cost, and reliability.
A collaborative spirit and the ability to build strong relationships with engineering, product, and operations teams.
Prior experience working closely with product engineering teams is a strong plus—this role thrives on cross-disciplinary understanding.
A commitment to continuous learning and improving team practices, systems, and culture.
Tech Stack
AWS
Cloud
Distributed Systems
Kubernetes
Python
Terraform
TypeScript
Benefits
Give you ownership over infrastructure that powers a globally-used platform, with clear visibility into how your work drives collaboration and productivity.
Provide meaningful opportunities to learn and grow, whether that's diving deeper into distributed systems, exploring new observability paradigms, or mastering the latest cloud-native technologies.
Surround you with a team that values blameless postmortems, continuous improvement, and the kind of operational culture where everyone learns from every incident.
Share the "why" behind architectural decisions and give you a voice in shaping Fixify's reliability engineering principles as we scale.
Connect you directly with product engineers and users, so you see firsthand how reliable infrastructure translates into delighted customers.
Let you work across a hybrid container and serverless infrastructure environment, using what works best and leaning into a service’s strengths.