Leidos is a company focused on innovative solutions, and they are seeking a Site Reliability Engineer to join their team. The role involves developing reusable solutions, automating infrastructure, and collaborating with Agile software teams to enhance CI/CD processes and deliver software efficiently.

Responsibilities:

Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding of an microservice enterprise system (cloud and on-premises)
Partner with development teams to improve services through rigorous testing and release procedures
Create sustainable systems and services through service automation
Manage on-premises and private/public cloud environments via infrastructure-as-code (IaC)
Enable the continuous integration and continuous delivery of our diverse suite of software products by applying best practices for infrastructure provisioning, configuration and automated software deployments
Continually evaluate fielded system deployments and apply best practices to facilitate continuous improvement that can be applied across teams
Work closely with engineering to help develop the best technical design and approach for new product installation and field service activities (software patches, cyber updates, etc.)
Develop solutions to complex technical issues and problems that impact multiple area or disciplines
Communicate with internal team members across multiple areas and coordinate completion of key deliverables across teams
Mentor other SREs in the art of deploying and maintaining production mission critical microservice enterprise systems
Resolve roadblocks for the field service team, working collaboratively with the product engineering, technical leadership, and others. This may include participation in on call rotations

Requirements:

Bachelor's degree in computer science or computer engineering with 4+ years of experience in a relevant field
Experience delivering entire projects or processes spanning multiple technical areas
Experience serving as a technical lead managing large projects or processes
Working knowledge of Agile Development and continuous integration and continuous delivery methodologies and tools
Expertise with Linux and Windows operating systems, network administration, and networking protocols/functions (e.g., HTTP, HTTPS, SSL/TLS, SMTP, DNS)
Expertise provisioning and managing resources within IaaS/Cloud infrastructures (e.g., Azure, AWS, Google Cloud Platform, etc.)
Experience with Terraform, Ansible, Helm, BASH Scripting, CloudFormation, Chef, Puppet, Ansible or similar technologies
Expertise with container technologies such as Docker and container orchestration tools like Kubernetes
Expertise with Kubernetes kubectl
Expertise of a version control system (e.g., Git)
Strong, self-motivated desire to learn new tools, frameworks, and techniques
Ability to complete tasking independently with minimal direct supervision
Ability to work and collaborate effectively within a multi-disciplined engineering team
Ability to obtain Public Trust access
Experience with Enterprise Event Brokers Technologies (Kafka, NATS)
Experience with monitoring and alerting tools such as Grafana, Prometheus
Experience with API Gateways such as ISTIO
Experience with GitOps tools such as Argo CD, Flux CD, Fleet or similar
Professional cybersecurity certification such as Security+, or similar
Knowledge of Agile Development methodologies
Familiarity with at least one Relational Database Management System (Oracle, MySQL, PostgreSQL, SQL Server, etc.)

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: