TechClub Inc is seeking a highly skilled Site Reliability Engineer (SRE) to ensure the health, availability, and performance of their enterprise platform. The role involves leading reliability engineering practices, managing infrastructure deployment, and implementing observability solutions while collaborating across various teams.

Responsibilities:

Own the end‑to‑end health, uptime, performance, and reliability of the platform across cloud (Azure) and on‑prem environments
Ensure resilience across application layers: .NET, Java, React.js, Microservices, and backend systems such as SQL Server and Kafka
Lead incident management, root cause analysis, and post‑incident reviews with a focus on continuous improvement
Design, implement, and maintain cloud and on‑prem infrastructure using Terraform (IaC)
Own and optimize CI/CD pipelines for infrastructure and applications in: GitHub Actions, Azure DevOps
Improve deployment automation, reliability, and release processes across all teams
Implement and enhance monitoring, alerting, dashboards, and analytics using: Dynatrace (APM, RUM, synthetic monitoring, logs, metrics), Splunk (log search, correlation, alerting)
Build proactive monitoring workflows to detect issues before they impact customers
Own SRE metrics such as SLOs, SLIs, Error Budgets, MTTR, MTBF, availability KPIs, and system productivity metrics
Performance tuning of the database / application services
Ensure all platform and application security vulnerabilities are identified and remediated on time
Partner with cybersecurity to ensure compliance with enterprise standards and policies
Automate security scans and integrate them into CI/CD pipelines
Conduct performance analysis, load testing, and tuning across: Microservices, SQL Server databases, Kafka clusters, Front‑end React.js applications
Partner with engineering teams to design scalable, reliable system architectures
Collaborate with development, architecture, infrastructure, and security teams
Advocate for SRE and DevOps culture—automation, reliability engineering, blameless postmortems
Mentor developers and engineers on reliability best practices and tools

Requirements:

5+ years of experience in SRE, DevOps, or Platform Engineering roles
Strong expertise in SQL Server administration and performance tuning
Strong expertise in .NET, Java, Microservices architectures
Strong expertise in React.js fundamentals
Hands‑on experience with Azure Cloud services (VMs, AKS, App Services, Networking)
Hands‑on experience with On‑prem servers and hybrid integrations
Hands‑on experience with Terraform (writing, testing, maintaining modules)
Hands‑on experience with CI/CD with GitHub and Azure DevOps
Proficiency with observability tools: Dynatrace (preferred), Splunk
Experience with Kafka (producers, consumers, performance, tuning)
Strong understanding of SRE fundamentals: SLO/SLI design, Error budgets, Distributed systems concepts, Incident response
Experience with containerization and Kubernetes (AKS or on‑prem K8s)
Experience with service mesh, API gateway technologies, or event‑driven architectures
Knowledge of secure coding practices and integrating security in CI/CD
Familiarity with enterprise networking, firewalls, and hybrid connectivity
Strong communication and collaboration abilities
Analytical mindset with strong problem‑solving skills
Ability to handle pressure in high‑severity incidents
Passion for automation, simplification, and continuous improvement

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: