Leidos is a company that provides engineering support to the U.S. Navy’s Service Management, Integration, and Transport program. The AI Reliability Engineer (AI-SRE) is responsible for integrating AI and machine learning capabilities into SRE operations to enhance system reliability and operational efficiency, while collaborating with various SRE teams to transform operational data into actionable insights.

Responsibilities:

Design, develop, and maintain AI/ML models for anomaly detection, trend analysis, and signal correlation across metrics, logs, traces, and events
Reduce alert noise through intelligent alert grouping, suppression, and prioritization
Enhance observability platforms with AI-generated insights supporting SLO and error-budget management
Implement AI-driven incident classification, enrichment, and summarization
Provide probable root-cause analysis recommendations based on historical and real-time telemetry
Support on-call and incident response teams with AI-guided remediation suggestions
Contribute AI insights to post-incident reviews and reliability improvement plans
Apply AI techniques to identify repetitive operational tasks and automation opportunities
Assist in generating, validating, and optimizing automation playbooks and workflows
Analyze automation execution data to improve success rates, resiliency, and reuse
Build and maintain AI-searchable knowledge repositories containing runbooks, SOPs, lessons learned, and historical incident data
Enable natural-language access to operational knowledge for SREs and operations staff
Reduce dependency on tribal knowledge through intelligent documentation and discovery
Develop predictive models for capacity planning, failure forecasting, configuration risk, and reliability debt identification
Support proactive remediation strategies to prevent incidents before customer impact
Assist SRE leadership in data-driven prioritization of reliability investments
Ensure AI solutions adhere to organizational security, compliance, and data-handling policies
Establish guardrails for AI recommendations, human-in-the-loop decision making, and automation execution
Promote transparency, explainability, and auditability of AI-driven operational decisions

Requirements:

Bachelor's degree in computer science, Engineering, Information Systems, Data Science, or related discipline
5+ years in Site Reliability Engineering, DevOps, IT Operations, or Systems Engineering
2+ years applying AI/ML techniques in operational, analytics, or automation contexts
Demonstrated experience supporting production systems in high-availability environments
Must have an active Secret Clearance in order to be considered for the position
Proficiency in data analysis tooling
Experience with machine learning fundamentals (anomaly detection, clustering, time-series analysis, NLP)
Familiarity with observability platforms (metrics, logs, traces, events)
Experience with automation frameworks and infrastructure-as-code concepts
Strong understanding of distributed systems and operational telemetry

Site Reliability Engineer (SRE) Artificial Intelligence (AI) Engineer

Key skills

About this role

Responsibilities:

Requirements: