Design, develop and operationally manage automated, resilient, high availability, self-healing, secure platforms with native-AI capabilities for IT needs, serving both internal as well as customer business capabilities
Develop , and manage the Observability OpenTelemetry Central Backend Stack: Grafana Enterprise, Mimir, Loki, Tempo, and Alertmanager on Kubernetes/RKE2 via Helm and GitLab CI -CD .
Build and manage iaC and CI-CD for automated provisioning and deployment, including Terraform modules for Infra/ VM/storage provisioning, Ansible AWX playbooks for OS/ App bootstrap, ArgoCD and Helm for Kubernetes configuration .
Develop and manage OpenTelemetry Prometheus scrape profile library including SNMP exporters, REST API exporters, and cloud provider exporters (CloudWatch, Azure Monitor, GCP) for multiple device classes.
Develop AIOps capabilities on platforms for e.g Observability use-cases : anomaly detection integrations, event correlation rules in Alertmanager , and synthetic monitoring patterns to reduce alert noise.
Configure and maintain Zabbix auto-discovery: network range scanning, device classification, and Prometheus service discovery integration.
Build and harden Edge Stack deployments (Prometheus + OTel collector) per data center site using GitOps templates.
Integrate Alertmanager with ServiceNow: webhook routing, ticket enrichment, auto-close logic, and escalation policy configuration.
Maintain platform security: Conjur /CyberArk secret injection at runtime, mTLS between stack components, RBAC in Grafana Enterprise.
Author and maintain Grafana dashboards in JSON/GitLab — facility overview, network health, RED metrics, application telemetry.
Mentor mid-level engineers, lead code reviews, and establish engineering standards for the team.
Represent platform engineering in cross-functional architecture reviews and executive-level program updates.
Perform other duties as required and assigned
Requirements
DevOps / Automation
5+ years in a production environment
Kubernetes (RKE2/k3s), Helm chart deployment, system services, Docker/ container
LGTM Stack Development and Configuration
4 + years : Grafana, Mimir, Loki, Tempo configuration, tuning, dash
boarding and production operation s ; Prometheus required
Senior-level Python / Scripting frameworks
5+ years, Automation scripts, exporter development, GitLab pipeline scripting, REST API integrations
GitOps / CI/CD
5+ years, GitLab CI/CD pipeline authoring; Terraform and Ansible as primary IaC tools; ArgoCD or Flux preferred