QData Inc is seeking a Senior Lead Site Reliability Engineer (SRE) to drive reliability, scalability, performance, and operational excellence across their cloud-native platforms. The ideal candidate will manage and support mission-critical applications, define and monitor service level indicators, and build automation for incident management and operational workflows.
Responsibilities:
- Lead SRE initiatives focused on system reliability, availability, scalability, and performance
- Manage and support mission-critical Java, Kafka, and Node.js applications in production environments
- Define and monitor SLIs, SLOs, and error budgets
- Build automation for incident management, deployments, remediation, and operational workflows
- Drive root cause analysis (RCA) and implement preventive measures
- Design and implement observability solutions including metrics, logging, tracing, and alerting
- Collaborate with development, DevOps, cloud, and security teams to improve platform resilience
- Support Kafka clusters, event-streaming infrastructure, and distributed systems
- Lead production readiness reviews and capacity planning exercises
- Participate in on-call rotations and major incident management activities
Requirements:
- 10+ years of experience in Site Reliability Engineering, Production Support, DevOps, or Platform Engineering
- Strong experience supporting Java/Spring Boot applications in production
- Hands-on experience with Node.js applications and microservices
- Deep knowledge of Apache Kafka administration, monitoring, troubleshooting, and performance tuning
- Experience with Kubernetes, Docker, and container orchestration
- Strong cloud experience in AWS, Azure, or GCP
- Expertise with Infrastructure as Code (Terraform, CloudFormation, etc.)
- Experience with CI/CD tools such as Jenkins, GitHub Actions, GitLab CI, or Azure DevOps
- Strong Linux and scripting skills (Python, Bash, Shell)
- Experience with monitoring and observability tools: Prometheus, Grafana, Datadog, Splunk, ELK Stack, Dynatrace, New Relic
- Understanding of Generative AI and Large Language Models (LLMs)
- Experience leveraging AI-powered monitoring and incident management tools
- Knowledge of AIOps platforms for anomaly detection and predictive alerting
- Familiarity with tools such as OpenAI APIs, GitHub Copilot, and AI-assisted operational workflows
- Ability to identify automation opportunities using AI technologies
- Experience in high-volume distributed systems and event-driven architectures
- Strong troubleshooting skills across application, infrastructure, network, and database layers
- Experience supporting large-scale Kafka environments
- Exposure to ServiceNow, PagerDuty, OpsGenie, or similar incident management platforms
- Financial Services, Banking, Retail, Healthcare, or SaaS industry experience