ThunderYard Solutions is seeking a Site Reliability Engineer to support the infrastructure of the Department of Veterans Affairs. The role involves building, monitoring, and maintaining secure systems while collaborating with senior engineers to improve system reliability and automation in a hybrid environment.
Responsibilities:
- Support the installation, configuration, and automation of Linux-based infrastructure in a hybrid on-premises and cloud environment
- Collaborate with senior engineers to implement infrastructure as code (IaC)
- Manage containerized workloads and improve system reliability through observability, performance tuning, and incident response
- Monitor critical services and respond to system alerts
- Document runbooks and standard operating procedures to ensure resilient and repeatable operations
Requirements:
- Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent work experience
- 5+ years of hands-on experience in system administration, SRE, or DevOps roles
- Experience supporting Linux-based systems in production environments
- Proficiency with scripting languages (e.g., Bash, Python, or PowerShell)
- Familiarity with monitoring tools (e.g., DynaTrace, Prometheus, Grafana, ELK, CloudWatch, etc.)
- Experience with CI/CD pipelines and infrastructure as code (IaC) tools like Terraform, Ansible, or CloudFormation
- Understanding of cloud platforms (e.g., AWS, Azure, GCP) and hybrid environments
- Prior experience supporting federal agencies, particularly the Department of Veterans Affairs or other health-related agencies (e.g., HHS)
- Knowledge of federal compliance standards such as FedRAMP, DISA STIGs, FISMA, or NIST 800-53
- Familiarity with containerization and orchestration (e.g., Docker, Kubernetes)
- Experience with incident response and postmortem analysis
- Understanding of networking concepts, firewalls, and VPNs
- Strong troubleshooting skills and a proactive approach to identifying and resolving reliability issues
- Ability to write clear, concise documentation and standard operating procedures