Applied Systems is a company dedicated to transforming the insurance industry by delivering innovative software and services. They are seeking a Senior Site Reliability Engineer to ensure the reliability, scalability, and efficiency of their software applications, playing a critical role in delivering best-in-class services to customers.
Responsibilities:
- Develop and maintain IaC using Terraform, Terraform CDK with TypeScript, Packer, and Ansible to automate on-prem and cloud infrastructure provisioning and management
- Collaborate with development and platform teams to design scalable, reliable systems with fault tolerance, high availability, and performance optimization
- Implement and manage monitoring solutions using Datadog to ensure system performance, tracing instrumentation, and adherence to SLI/SLO/SLAs
- Utilize HashiCorp Consul for service discovery, dynamic configuration, and network automation across distributed systems
- Define and implement best practices for disaster recovery and high availability across hybrid environments
- Build and maintain CI/CD pipelines using tools like GitLab and GitHub Actions to streamline deployments and ensure code quality
- Automate repetitive tasks to increase efficiency and reduce human error, leveraging tools like Python, Go, Bash, and PowerShell
- Manage Kubernetes environments, including Helm charts and ArgoCD for application deployment and orchestration
- Mentor junior engineers, lead technical discussions, and collaborate across teams to drive consensus on design decisions and technical initiatives
- Create and maintain accurate documentation for workflows, procedures, and infrastructure standards to support internal teams and customers
- Participate in the on-call rotation to provide production support and resolve complex engineering challenges
- Work with third-party vendors to evaluate and integrate their products and services into the infrastructure ecosystem
Requirements:
- 5+ years of experience in DevOps, SRE, or Infrastructure Engineering roles
- Strong foundations in the areas of Incident Management, Troubleshooting, Observability of software applications
- Experience with cloud platforms (GCP, AWS, Azure), including traffic management solutions
- Familiarity with distributed systems, microservices architecture, and related technologies
- Proficiency in Python, Go, Bash, and PowerShell
- Expertise in Windows and Linux system administration
- Advanced knowledge of IaC tools like Terraform, including Terraform CDK with TypeScript, Packer, and HCL
- Knowledge of CI/CD pipelines and version control systems (GitLab, GitHub Actions, etc.)
- Familiarity with monitoring tools (Datadog) and security solutions (HashiCorp Vault, Cloud Armor)
- Experience with SQL Server and PostgreSQL for database management
- Kubernetes expertise, including Helm charts and ArgoCD for application deployment and orchestration
- Excellent communication skills to collaborate with engineers, product managers, and business stakeholders
- Strong organizational skills and attention to detail
- Ability to prioritize tasks and make accurate decisions under pressure
- Passion for mentoring and guiding team members