Home
Jobs
Saved
Resumes
Site Reliability Engineer, Artificial Intelligence Engineer at Leidos | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Site Reliability Engineer, Artificial Intelligence Engineer
Leidos
Remote
Website
LinkedIn
Site Reliability Engineer, Artificial Intelligence Engineer
United States
Full Time
5 days ago
$131,300 - $237,350 USD
No H1B
Apply Now
Key skills
Distributed Systems
AI
Machine Learning
ML
NLP
Analytics
Leadership
About this role
Role Overview
Design, develop, and maintain AI/ML models for anomaly detection, trend analysis, and signal correlation across metrics, logs, traces, and events.
Reduce alert noise through intelligent alert grouping, suppression, and prioritization.
Enhance observability platforms with AI-generated insights supporting SLO and error-budget management.
Implement AI-driven incident classification, enrichment, and summarization.
Provide probable root-cause analysis recommendations based on historical and real-time telemetry.
Support on-call and incident response teams with AI-guided remediation suggestions.
Contribute AI insights to post-incident reviews and reliability improvement plans.
Apply AI techniques to identify repetitive operational tasks and automation opportunities.
Assist in generating, validating, and optimizing automation playbooks and workflows.
Analyze automation execution data to improve success rates, resiliency, and reuse.
Build and maintain AI-searchable knowledge repositories containing runbooks, SOPs, lessons learned, and historical incident data.
Enable natural-language access to operational knowledge for SREs and operations staff.
Develop predictive models for capacity planning, failure forecasting, configuration risk, and reliability debt identification.
Support proactive remediation strategies to prevent incidents before customer impact.
Assist SRE leadership in data-driven prioritization of reliability investments.
Ensure AI solutions adhere to organizational security, compliance, and data-handling policies.
Establish guardrails for AI recommendations and automation execution.
Promote transparency, explainability, and auditability of AI-driven operational decisions.
Requirements
Bachelor’s degree in computer science, Engineering, Information Systems, Data Science, or related discipline
5+ years in Site Reliability Engineering, DevOps, IT Operations, or Systems Engineering
2+ years applying AI/ML techniques in operational, analytics, or automation contexts
Demonstrated experience supporting production systems in high-availability environments
Must have an active Secret Clearance in order to be considered for the position
Proficiency in data analysis tooling
Experience with machine learning fundamentals (anomaly detection, clustering, time-series analysis, NLP)
Familiarity with observability platforms (metrics, logs, traces, events)
Experience with automation frameworks and infrastructure-as-code concepts
Strong understanding of distributed systems and operational telemetry
Tech Stack
Distributed Systems
Benefits
Competitive compensation
Health and Wellness programs
Income Protection
Paid Leave
Retirement
Apply Now
Home
Jobs
Saved
Resumes