CertifID is a company dedicated to enhancing security in the real estate sector by providing a secure platform that verifies identities and authenticates wire transfer instructions. They are seeking a Senior Site Reliability Engineer to drive reliability improvements in their production SaaS environment, focusing on building scalable infrastructure, improving incident response, and mentoring junior engineers.

Responsibilities:

Own and improve the reliability, availability, and performance of production systems while defining and operationalizing SLIs/SLOs and error budgets
Design and implement autonomous and semi-autonomous AI agents for monitoring distributed systems and applications
Build agents capable of consuming multi-source observability data (metrics, logs, traces, etc.)
Participate in and help lead an on-call rotation, serving as an escalation point for major incidents and facilitating blameless postmortems
Build automated workflows to eliminate manual work and design/maintain Infrastructure-as-Code with Terraform
Improve metrics, logs, traces, and alerting using tools like Datadog or Prometheus to reduce noise and increase signal
Partner with application teams to implement reliability best practices and mentor junior engineers to foster a culture of knowledge sharing

Requirements:

5+ years in SRE, DevOps, Platform Engineering, or Infrastructure Engineering
Proven experience supporting production SaaS systems in Azure (preferred), AWS, or GCP
Strong Linux, networking, and distributed systems troubleshooting skills
Strong experience with containers and orchestration (Kubernetes/EKS/AKS)
Expertise with Infrastructure-as-Code (Terraform strongly preferred)
Strong scripting/programming skills in Python, Go, Bash, or C#/.NET
Hands-on experience with Datadog, Prometheus/Grafana, or OpenTelemetry

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: