CertifID is dedicated to enhancing security and fighting fraud in the real estate sector. They are seeking a Senior Site Reliability Engineer to drive reliability improvements across their production SaaS environment, focusing on building scalable infrastructure, improving incident response, and mentoring junior engineers.
Responsibilities:
- Own and improve the reliability, availability, and performance of production systems while defining and operationalizing SLIs/SLOs and error budgets
- Design and implement autonomous and semi-autonomous AI agents for monitoring distributed systems and applications. Build agents capable of consuming multi-source observability data (metrics, logs, traces, etc.)
- Participate in and help lead an on-call rotation, serving as an escalation point for major incidents and facilitating blameless postmortems
- Build automated workflows to eliminate manual work and design/maintain Infrastructure-as-Code with Terraform
- Improve metrics, logs, traces, and alerting using tools like Datadog or Prometheus to reduce noise and increase signal
- Partner with application teams to implement reliability best practices and mentor junior engineers to foster a culture of knowledge sharing
Requirements:
- 5+ years in SRE, DevOps, Platform Engineering, or Infrastructure Engineering
- Proven experience supporting production SaaS systems in Azure (preferred), AWS, or GCP
- Strong Linux, networking, and distributed systems troubleshooting skills
- Strong experience with containers and orchestration (Kubernetes/EKS/AKS)
- Expertise with Infrastructure-as-Code (Terraform strongly preferred)
- Strong scripting/programming skills in Python, Go, Bash, or C#/.NET
- Hands-on experience with Datadog, Prometheus/Grafana, or OpenTelemetry