Fieldguide is establishing a new state of trust for global commerce and capital markets through automating and streamlining the work of assurance and audit practitioners specifically within cybersecurity, privacy, and financial audit. As a Senior Site Reliability Engineer, you will ensure the reliability, scalability, and observability of production systems while collaborating with engineering teams to improve system performance and drive operational excellence.
Responsibilities:
- Design and operate highly scalable, fault-tolerant systems that support production workloads across a distributed cloud environment
- Define and implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to guide reliability decisions
- Build and improve observability systems (metrics, logs, tracing) to provide deep visibility into system behavior and performance
- Lead efforts to improve system reliability and performance, including capacity planning, load testing, and performance tuning
- Automate operational processes to reduce manual toil and improve system consistency and resilience
- Partner with engineering teams to design systems with reliability and scalability built in from the start
- Participate in and improve incident response, on-call practices, and post-incident reviews, focusing on root cause analysis and systemic improvements
- Drive continuous improvement of system resilience, including disaster recovery and chaos testing
- Establish best practices for monitoring, alerting, and incident management to ensure rapid detection and resolution of issues
- Advocate for reliability-focused engineering culture, including blameless postmortems and operational excellence
Requirements:
- 5+ years of experience in site reliability engineering, infrastructure, or a related software engineering discipline
- Strong experience operating and scaling distributed systems in cloud environments, with AWS preferred
- Hands-on experience building and managing observability platforms (e.g., Datadog, Prometheus, Grafana, CloudWatch)
- Experience defining SLOs/SLIs and leveraging them to inform and drive engineering priorities
- Proficiency with Infrastructure as Code tooling, particularly Terraform or equivalent
- Deep understanding of system performance, reliability patterns, and distributed system failure modes
- Experience supporting production systems through on-call rotations and incident response
- Proficiency in at least one programming or scripting language used for automation and tooling
- Strong communication and collaboration skills, with the ability to work effectively across engineering and product teams
- Experience implementing distributed tracing systems, such as OpenTelemetry or similar frameworks
- Experience with capacity planning and performance benchmarking at scale
- Familiarity with database performance tuning and observability across high-traffic systems
- Exposure to regulated or compliance-heavy engineering environments (e.g., SOC 2, FedRAMP, or equivalent frameworks)
- Experience applying chaos engineering practices to proactively test and strengthen system resilience