Act as Incident Commander during major severity incidents affecting payments, trading, or compliance systems: coordinate cross-functional response, provide clear status updates, and drive post-mortems.
Design and implement observability strategies using Grafana, Sentry, and CloudWatch.
Instrument Go services to expose high-cardinality metrics and distributed traces.
Collaboratively define, measure, and defend Service Level Objectives (SLOs) and Error Budgets with product and engineering teams.
Write production-ready code to build internal tooling, automation platforms, and self-healing mechanisms that eliminate manual operator intervention.
Contribute reliability patterns (circuit breakers, retries, backpressure) directly to backend services.
Partner with backend engineering teams during the design phase to ensure new services are built with reliability, scalability, and observability patterns from day one.
Analyze system performance and traffic patterns to model future capacity needs.
Conduct load testing and chaos engineering experiments to verify system resilience under failure conditions, particularly for financial transactions and compliance workflows.

Minimum of 4 years of experience in SRE or Backend Engineering with good proficiency in Go.
You can read, write, and review production Go code, not just deploy it.
Deep understanding of distributed systems architecture and design patterns.
Strong command of microservices fundamentals, event-driven architectures, and the underlying principles required to build systems that scale.
Hands-on experience with AWS (ECS, RDS, CloudWatch, Lambda) or GCP, and infrastructure as code.
Proficiency in running production workloads and troubleshooting infrastructure issues.
Experience designing and implementing observability strategies using Prometheus, Grafana, OpenTelemetry, or similar tools.
Ability to instrument code for proper monitoring and alerting.
Familiarity with operating and tuning production data stores (PostgreSQL, ClickHouse) and streaming platforms (RabbitMQ, Kafka) in high-throughput environments.

Site Reliability Engineer – Infrastructure

Key skills