Finite State is a fast-growing series-B company focused on securing the connected world by providing transparency for connected devices and supply chains. They are seeking a Senior Site Reliability Engineer (SRE) to define and drive observability and reliability strategies within an AI-first development organization, ensuring operational excellence and infrastructure automation.

Responsibilities:

Leverage AI tools and Agentic processes to drive observability, quality, responsiveness, and operational clarity
Design modern telemetry pipelines (metrics, logs, traces, events) for distributed systems and AI-driven workloads
Define and implement a comprehensive observability framework across applications and infrastructure
Establish and operationalize meaningful SLIs, SLOs, and SLAs aligned with business objectives
Lead the adoption and optimization of observability tooling including Honeycomb, Grafana, and related telemetry platforms
Drive best practices in error budgeting, alert design, and production health monitoring
Define and evolve incident management processes, including:
On-call structures and escalation models
Postmortems and blameless retrospectives
Runbooks and operational playbooks
Improve system reliability, performance, scalability, and cost efficiency
Establish operational KPIs and reliability dashboards for engineering and leadership visibility
Lead reliability reviews for new architecture and product initiatives
Architect and implement scalable cloud infrastructure primarily within AWS
Work closely with modern application platforms such as Vercel and Supabase
Implement and improve Infrastructure-as-Code practices
Leverage AI-assisted tooling to accelerate infrastructure design, validation, and automation
Ensure production-grade security, compliance, and resilience standards
Champion the use of AI tools to:
Accelerate infrastructure provisioning
Improve operational workflows
Enhance observability signal quality
Automate incident response and remediation
Partner with AI-focused product teams to ensure observability supports model performance, experimentation, and reliability
Serve as a senior technical authority for reliability and infrastructure decisions
Mentor engineers on production best practices
Influence architectural decisions to improve system resilience and maintainability
Drive a culture of reliability, accountability, and continuous improvement

Requirements:

10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Production Engineering
Proven experience defining and implementing SLOs, SLAs, SLIs, and error budget frameworks at scale
Deep experience building and managing on-call rotations and incident management processes
Strong background in distributed systems and cloud-native architectures
Hands-on experience with Honeycomb
Hands-on experience with Grafana
Hands-on experience with AWS
Hands-on experience with Vercel
Hands-on experience with Supabase
Strong experience with observability instrumentation and telemetry design
Infrastructure-as-Code experience (e.g., Terraform, Pulumi, or similar)
Experience designing resilient CI/CD pipelines
Deep understanding of high-availability, scalability, and performance engineering principles
Demonstrated experience leveraging AI tools (Cursor, Claude, Codex, etc.) in development or infrastructure workflows
Experience using AI-assisted tooling to generate, validate, or optimize infrastructure configurations
Strong interest in building AI-native operational practices
Ability to operate as both strategic architect and hands-on implementer
Strong written and verbal communication skills
Experience influencing cross-functional teams
Comfort working in fast-paced, high-growth environments
Experience supporting AI/ML workloads in production
Experience building internal developer platforms (IDP)
Experience with cost observability and FinOps practices
Experience scaling observability in high-growth SaaS environments

Senior Site Reliability Engineer (SRE)

Key skills

About this role

Responsibilities:

Requirements: