QData Inc is seeking a Senior Lead Site Reliability Engineer (SRE) to drive reliability, scalability, performance, and operational excellence across their cloud-native platforms. The ideal candidate will manage and support mission-critical applications, define and monitor service level indicators, and build automation for incident management and operational workflows.

Responsibilities:

Lead SRE initiatives focused on system reliability, availability, scalability, and performance
Manage and support mission-critical Java, Kafka, and Node.js applications in production environments
Define and monitor SLIs, SLOs, and error budgets
Build automation for incident management, deployments, remediation, and operational workflows
Drive root cause analysis (RCA) and implement preventive measures
Design and implement observability solutions including metrics, logging, tracing, and alerting
Collaborate with development, DevOps, cloud, and security teams to improve platform resilience
Support Kafka clusters, event-streaming infrastructure, and distributed systems
Lead production readiness reviews and capacity planning exercises
Participate in on-call rotations and major incident management activities

Requirements:

10+ years of experience in Site Reliability Engineering, Production Support, DevOps, or Platform Engineering
Strong experience supporting Java/Spring Boot applications in production
Hands-on experience with Node.js applications and microservices
Deep knowledge of Apache Kafka administration, monitoring, troubleshooting, and performance tuning
Experience with Kubernetes, Docker, and container orchestration
Strong cloud experience in AWS, Azure, or GCP
Expertise with Infrastructure as Code (Terraform, CloudFormation, etc.)
Experience with CI/CD tools such as Jenkins, GitHub Actions, GitLab CI, or Azure DevOps
Strong Linux and scripting skills (Python, Bash, Shell)
Experience with monitoring and observability tools: Prometheus, Grafana, Datadog, Splunk, ELK Stack, Dynatrace, New Relic
Understanding of Generative AI and Large Language Models (LLMs)
Experience leveraging AI-powered monitoring and incident management tools
Knowledge of AIOps platforms for anomaly detection and predictive alerting
Familiarity with tools such as OpenAI APIs, GitHub Copilot, and AI-assisted operational workflows
Ability to identify automation opportunities using AI technologies
Experience in high-volume distributed systems and event-driven architectures
Strong troubleshooting skills across application, infrastructure, network, and database layers
Experience supporting large-scale Kafka environments
Exposure to ServiceNow, PagerDuty, OpsGenie, or similar incident management platforms
Financial Services, Banking, Retail, Healthcare, or SaaS industry experience

Senior Lead Site Reliability Engineer (SRE) – Java, Kafka, Node.js & AI Awareness

Key skills

About this role

Responsibilities:

Requirements: