Title: AI/ML Engineer (Observability & Dashboards creation)
Location: Hybrid onsite at Dallas, TX, 75019 / Tampa, FL, 33647
Type: Contract to Hire
Job Description:
Overview
We are seeking a passionate and hands-on AI/ML Engineer to accelerate our Enterprise Observability strategy. This role will design, build, and operationalize AI/ML capabilities that enhance end to end telemetry pipelines, anomaly detection, intelligent alerting, and proactive system resiliency.
You will work at the intersection of AI/ML engineering, Observability platforms, and automation, developing solutions that improve detection, diagnosis, and prevention of operational issues across distributed systems.
________________________________________
Key Responsibilities
• Design and deploy AI/ML models supporting anomaly detection, baselining, event correlation, and predictive operational analytics.
• Build and integrate AI‐enabled capabilities into enterprise Observability platforms, including Grafana, APM/RUM tools, network telemetry systems, and data observability tools.
• Develop AI Agents that can autonomously triage issues, recommend corrective actions, and initiate automated remediation workflows to reduce recovery time and improve system resilience.
• Implement self‐healing automation using AI‐driven decisioning, integrating with orchestration frameworks, service APIs, and infrastructure automation pipelines.
• Engineer and maintain real‐time and batch data pipelines using Snowflake ML Jobs, Snowflake Cortex, streams, tasks, and UDFs.
• Implement and manage OpenTelemetry‐based telemetry ingestion for logs, metrics, traces, and spans across distributed systems.
• Build asynchronous Python APIs and services for model inferencing and operational integration.
• Enhance observability intelligence with AI-powered capabilities such as root‐cause acceleration, chatbot/search enablement, and automated insights.
• Contribute to SLO/SLI modeling, Golden Signals instrumentation, and Observability NFR adoption.
• Collaborate across engineering, SRE, platform and business teams to embed proactive intelligence and Observability standards throughout the ecosystem.
Required Skills & Qualifications
Core Technical Skills
• Strong proficiency in Python and data science/ML libraries: NumPy, Pandas, scikit learn, TensorFlow, PyTorch, Matplotlib, Seaborn.
• Experience with Generative AI, LLM fine tuning, prompt engineering, RAG pipelines, and LLM evaluation frameworks.
• Expertise in developing and deploying ML models in production (batch & streaming).
• Strong understanding of statistics, time series modeling, and anomaly detection.
Observability & Telemetry
• Experience with OpenTelemetry for logs, metrics, traces, spans.
• Familiarity with Observability concepts: Golden Signals, SLO/SLI design, APM, RUM, Synthetics, event correlation, baselining.
• Experience with Observability tools such as: Grafana (Alloy agents, dashboards, ML capabilities), Dynatrace, Monte Carlo (Data Observability), Netscout, ThousandEyes, SolarWinds, NetBrain.