Finite State is a fast-growing series-B company focused on securing the connected world by providing transparency for connected devices and supply chains. They are seeking a Senior Site Reliability Engineer (SRE) to define and drive observability and reliability strategies within an AI-first development organization, ensuring operational excellence and infrastructure automation.
Responsibilities:
- Leverage AI tools and Agentic processes to drive observability, quality, responsiveness, and operational clarity
- Design modern telemetry pipelines (metrics, logs, traces, events) for distributed systems and AI-driven workloads
- Define and implement a comprehensive observability framework across applications and infrastructure
- Establish and operationalize meaningful SLIs, SLOs, and SLAs aligned with business objectives
- Lead the adoption and optimization of observability tooling including Honeycomb, Grafana, and related telemetry platforms
- Drive best practices in error budgeting, alert design, and production health monitoring
- Define and evolve incident management processes, including:
- On-call structures and escalation models
- Postmortems and blameless retrospectives
- Runbooks and operational playbooks
- Improve system reliability, performance, scalability, and cost efficiency
- Establish operational KPIs and reliability dashboards for engineering and leadership visibility
- Lead reliability reviews for new architecture and product initiatives
- Architect and implement scalable cloud infrastructure primarily within AWS
- Work closely with modern application platforms such as Vercel and Supabase
- Implement and improve Infrastructure-as-Code practices
- Leverage AI-assisted tooling to accelerate infrastructure design, validation, and automation
- Ensure production-grade security, compliance, and resilience standards
- Champion the use of AI tools to:
- Accelerate infrastructure provisioning
- Improve operational workflows
- Enhance observability signal quality
- Automate incident response and remediation
- Partner with AI-focused product teams to ensure observability supports model performance, experimentation, and reliability
- Serve as a senior technical authority for reliability and infrastructure decisions
- Mentor engineers on production best practices
- Influence architectural decisions to improve system resilience and maintainability
- Drive a culture of reliability, accountability, and continuous improvement
Requirements:
- 10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Production Engineering
- Proven experience defining and implementing SLOs, SLAs, SLIs, and error budget frameworks at scale
- Deep experience building and managing on-call rotations and incident management processes
- Strong background in distributed systems and cloud-native architectures
- Hands-on experience with Honeycomb
- Hands-on experience with Grafana
- Hands-on experience with AWS
- Hands-on experience with Vercel
- Hands-on experience with Supabase
- Strong experience with observability instrumentation and telemetry design
- Infrastructure-as-Code experience (e.g., Terraform, Pulumi, or similar)
- Experience designing resilient CI/CD pipelines
- Deep understanding of high-availability, scalability, and performance engineering principles
- Demonstrated experience leveraging AI tools (Cursor, Claude, Codex, etc.) in development or infrastructure workflows
- Experience using AI-assisted tooling to generate, validate, or optimize infrastructure configurations
- Strong interest in building AI-native operational practices
- Ability to operate as both strategic architect and hands-on implementer
- Strong written and verbal communication skills
- Experience influencing cross-functional teams
- Comfort working in fast-paced, high-growth environments
- Experience supporting AI/ML workloads in production
- Experience building internal developer platforms (IDP)
- Experience with cost observability and FinOps practices
- Experience scaling observability in high-growth SaaS environments