Role Overview

Provision multi-tenant environments: tenant creation, log file type registration, product family configuration, severity thresholds, and API key management.
Guide customers through LogIQ's Signature Onboarding Wizard.
Configure per-tenant defaults and document every configuration decision in customer-specific runbooks for long-term maintainability.
Validate the full detection lifecycle end-to-end on customer log samples before any go-live, including quality benchmarks on hold-out data.
Set up real-time log stream ingestion pipelines — Kafka, Kinesis, Fluentd, syslog-ng, or customer-native agents — into LogIQ's streaming layer.
Configure the Anomaly Detection engine: define healthy baselines, tune sensitivity thresholds, and map deviation patterns to specific signature triggers.
Wire streaming triggers to the RCA Agent so that when an anomaly fires, root-cause investigation begins automatically with no human intervention.
Monitor stream health: lag, throughput, parsing error rates, and alert on pipeline degradation before it affects customer outcomes.
Work with customers to identify which log sources to prioritize for streaming vs. batch ingestion, balancing latency requirements against infrastructure cost.
Ingest and index customer knowledge articles, historical case resolutions, and equipment documentation into the RCA Agent's retrieval layer (OpenSearch + pgvector).
Configure evidence-weighting rules so the RCA Agent knows which sources to trust most for a given equipment type or failure mode.
Tune reasoning prompts and retrieval strategies based on observed RCA quality — iterating until root-cause accuracy meets the customer's acceptance criteria.
Build fix-strategy libraries: map known root causes to recommended remediation steps, pulling from customer SOPs and historical tickets.
Validate RCA output against historical cases where the true root cause is known; track precision and recall over iteration cycles.
Ingest, clean, and pre-label customer-provided log samples to build compelling, domain-specific demos that speak directly to the customer's operational pain.
Demonstrate both reactive (case upload → signature detection → RCA → fix recommendation) and proactive (live stream → anomaly trigger → automated RCA) workflows against real data.
Create demo scripts, scenario walkthroughs, before/after MTTR comparisons, and leave-behind documentation for prospects.
Adapt demos quickly to new industries or log types — a customer in manufacturing should see their alarm formats, their fault patterns, their fix vocabulary.
Design, build, and register new LangGraph agent tools as customer use cases demand — e.g., a tool that queries a customer's CMDB, pulls ticket history from ServiceNow, or fetches firmware changelogs from an internal API.
Package reusable capabilities as LogIQ Skills: self-contained, versioned bundles of tools, prompts, and configuration that can be applied across customers in the same domain.
Maintain a tool allowlist and review process so new tools integrate safely with the agent's execution context and tenant isolation guarantees.
Contribute high-quality tools back to the platform's shared tool library so the whole team benefits.
Write custom log parsers for proprietary or undocumented equipment formats (Python, plugged into the FastAPI parser registry).
Build data connectors for customer-specific ingestion sources: REST APIs, SFTP drops, database exports, or cloud storage buckets.
Define record-splitting rules, type classifiers, and deep-parsed field schemas for new log file types using the Signature Onboarding pipeline.
Maintain a parser test suite — real sample lines, expected field outputs — so parsers don't regress across platform updates.
Tune LLM system prompts, memory strategies, context windows, and few-shot examples based on observed agent behavior on customer data.
Modify the signature workflow DAG to handle customer-specific detection logic that the automated agent generation doesn't cover out of the box.
Ship targeted bug fixes and feature additions back to the core platform codebase — you are a contributor, not just a consumer.
Debug async pipeline failures.
Own the technical relationship for your customer portfolio: onboarding calls, weekly syncs, async Slack/email, and escalation handling.
Translate customer domain knowledge — telecom alarm semantics, SCADA event codes, IT operations terminology — into LogIQ configuration and agent guidance.
Train customer teams to operate LogIQ independently: run their own demos, onboard new signatures, and interpret RCA outputs.
Surface recurring pain points and propose product improvements; your customer exposure gives you signal the core product team cannot get from anywhere else.

Requirements

4+ years of production Python. Comfortable with asyncio, FastAPI, Pydantic v2, and SQLAlchemy 2.0. Ability to read and extend an unfamiliar codebase quickly.
Hands-on experience building or operating LLM-powered agent pipelines — LangChain, LangGraph, CrewAI, AutoGen, or equivalent. Understands state graphs, tool calls, memory, and multi-step reasoning loops.
Can design, implement, and register new agent tools using the @tool decorator pattern (LangGraph/LangChain). Understands tool allowlists, input/output schemas, and safe integration with existing agent contexts.
Can systematically diagnose LLM failure modes and improve prompts through controlled iteration. Understands token budgeting, few-shot construction, output format control, and context window management.
Working knowledge of at least one streaming or log-shipping technology — Kafka, Kinesis, Fluentd, Logstash, syslog-ng, or similar. Understands consumer lag, backpressure, and at-least-once delivery semantics.
Understands async task queues (Celery, SQS, Redis), message broker patterns, and how to debug distributed pipeline failures from logs and traces.
Solid PostgreSQL fundamentals: schema design, JSONB queries, indexing. Exposure to time-series stores (TimescaleDB) and full-text search (OpenSearch / Elasticsearch) is a plus.
Comfortable with AWS (S3, SQS, IAM, Kinesis) or Azure equivalents. Docker and container-based local deployments. Familiarity with docker-compose for multi-service dev environments.
Strong written and spoken English. Can explain a multi-stage agent failure to a non-technical operations director. Experience in customer-facing technical roles — solutions engineering, implementation, pre-sales, or technical consulting — is a strong plus.
Education: B.E. / B.Tech or M.Tech in Computer Science, Electronics, or a related engineering discipline. Equivalent industry experience is fully acceptable.

Tech Stack

AWS
Azure
Cloud
Docker
ElasticSearch
Kafka
Logstash
Postgres
Python
Redis
ServiceNow

Senior AI Solutions Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack