Fieldguide is establishing a new state of trust for global commerce and capital markets through automating and streamlining the work of assurance and audit practitioners specifically within cybersecurity, privacy, and financial audit. As a Senior Site Reliability Engineer, you will ensure the reliability, scalability, and observability of production systems while collaborating with engineering teams to improve system performance and drive operational excellence.

Responsibilities:

Design and operate highly scalable, fault-tolerant systems that support production workloads across a distributed cloud environment
Define and implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to guide reliability decisions
Build and improve observability systems (metrics, logs, tracing) to provide deep visibility into system behavior and performance
Lead efforts to improve system reliability and performance, including capacity planning, load testing, and performance tuning
Automate operational processes to reduce manual toil and improve system consistency and resilience
Partner with engineering teams to design systems with reliability and scalability built in from the start
Participate in and improve incident response, on-call practices, and post-incident reviews, focusing on root cause analysis and systemic improvements
Drive continuous improvement of system resilience, including disaster recovery and chaos testing
Establish best practices for monitoring, alerting, and incident management to ensure rapid detection and resolution of issues
Advocate for reliability-focused engineering culture, including blameless postmortems and operational excellence

Requirements:

5+ years of experience in site reliability engineering, infrastructure, or a related software engineering discipline
Strong experience operating and scaling distributed systems in cloud environments, with AWS preferred
Hands-on experience building and managing observability platforms (e.g., Datadog, Prometheus, Grafana, CloudWatch)
Experience defining SLOs/SLIs and leveraging them to inform and drive engineering priorities
Proficiency with Infrastructure as Code tooling, particularly Terraform or equivalent
Deep understanding of system performance, reliability patterns, and distributed system failure modes
Experience supporting production systems through on-call rotations and incident response
Proficiency in at least one programming or scripting language used for automation and tooling
Strong communication and collaboration skills, with the ability to work effectively across engineering and product teams
Experience implementing distributed tracing systems, such as OpenTelemetry or similar frameworks
Experience with capacity planning and performance benchmarking at scale
Familiarity with database performance tuning and observability across high-traffic systems
Exposure to regulated or compliance-heavy engineering environments (e.g., SOC 2, FedRAMP, or equivalent frameworks)
Experience applying chaos engineering practices to proactively test and strengthen system resilience

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: