Leidos is a company focused on innovative solutions, and they are seeking a Site Reliability Engineer to join their team. The role involves developing reusable solutions, automating infrastructure, and collaborating with Agile software teams to enhance CI/CD processes and deliver software efficiently.
Responsibilities:
- Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding of an microservice enterprise system (cloud and on-premises)
- Partner with development teams to improve services through rigorous testing and release procedures
- Create sustainable systems and services through service automation
- Manage on-premises and private/public cloud environments via infrastructure-as-code (IaC)
- Enable the continuous integration and continuous delivery of our diverse suite of software products by applying best practices for infrastructure provisioning, configuration and automated software deployments
- Continually evaluate fielded system deployments and apply best practices to facilitate continuous improvement that can be applied across teams
- Work closely with engineering to help develop the best technical design and approach for new product installation and field service activities (software patches, cyber updates, etc.)
- Develop solutions to complex technical issues and problems that impact multiple area or disciplines
- Communicate with internal team members across multiple areas and coordinate completion of key deliverables across teams
- Mentor other SREs in the art of deploying and maintaining production mission critical microservice enterprise systems
- Resolve roadblocks for the field service team, working collaboratively with the product engineering, technical leadership, and others. This may include participation in on call rotations
Requirements:
- Bachelor's degree in computer science or computer engineering with 4+ years of experience in a relevant field
- Experience delivering entire projects or processes spanning multiple technical areas
- Experience serving as a technical lead managing large projects or processes
- Working knowledge of Agile Development and continuous integration and continuous delivery methodologies and tools
- Expertise with Linux and Windows operating systems, network administration, and networking protocols/functions (e.g., HTTP, HTTPS, SSL/TLS, SMTP, DNS)
- Expertise provisioning and managing resources within IaaS/Cloud infrastructures (e.g., Azure, AWS, Google Cloud Platform, etc.)
- Experience with Terraform, Ansible, Helm, BASH Scripting, CloudFormation, Chef, Puppet, Ansible or similar technologies
- Expertise with container technologies such as Docker and container orchestration tools like Kubernetes
- Expertise with Kubernetes kubectl
- Expertise of a version control system (e.g., Git)
- Strong, self-motivated desire to learn new tools, frameworks, and techniques
- Ability to complete tasking independently with minimal direct supervision
- Ability to work and collaborate effectively within a multi-disciplined engineering team
- Ability to obtain Public Trust access
- Experience with Enterprise Event Brokers Technologies (Kafka, NATS)
- Experience with monitoring and alerting tools such as Grafana, Prometheus
- Experience with API Gateways such as ISTIO
- Experience with GitOps tools such as Argo CD, Flux CD, Fleet or similar
- Professional cybersecurity certification such as Security+, or similar
- Knowledge of Agile Development methodologies
- Familiarity with at least one Relational Database Management System (Oracle, MySQL, PostgreSQL, SQL Server, etc.)