Applied Systems is a company dedicated to transforming the insurance industry by delivering innovative software and services. They are seeking a Senior Site Reliability Engineer to ensure the reliability, scalability, and efficiency of their software applications, playing a critical role in delivering best-in-class services to customers.

Responsibilities:

Develop and maintain IaC using Terraform, Terraform CDK with TypeScript, Packer, and Ansible to automate on-prem and cloud infrastructure provisioning and management
Collaborate with development and platform teams to design scalable, reliable systems with fault tolerance, high availability, and performance optimization
Implement and manage monitoring solutions using Datadog to ensure system performance, tracing instrumentation, and adherence to SLI/SLO/SLAs
Utilize HashiCorp Consul for service discovery, dynamic configuration, and network automation across distributed systems
Define and implement best practices for disaster recovery and high availability across hybrid environments
Build and maintain CI/CD pipelines using tools like GitLab and GitHub Actions to streamline deployments and ensure code quality
Automate repetitive tasks to increase efficiency and reduce human error, leveraging tools like Python, Go, Bash, and PowerShell
Manage Kubernetes environments, including Helm charts and ArgoCD for application deployment and orchestration
Mentor junior engineers, lead technical discussions, and collaborate across teams to drive consensus on design decisions and technical initiatives
Create and maintain accurate documentation for workflows, procedures, and infrastructure standards to support internal teams and customers
Participate in the on-call rotation to provide production support and resolve complex engineering challenges
Work with third-party vendors to evaluate and integrate their products and services into the infrastructure ecosystem

Requirements:

5+ years of experience in DevOps, SRE, or Infrastructure Engineering roles
Strong foundations in the areas of Incident Management, Troubleshooting, Observability of software applications
Experience with cloud platforms (GCP, AWS, Azure), including traffic management solutions
Familiarity with distributed systems, microservices architecture, and related technologies
Proficiency in Python, Go, Bash, and PowerShell
Expertise in Windows and Linux system administration
Advanced knowledge of IaC tools like Terraform, including Terraform CDK with TypeScript, Packer, and HCL
Knowledge of CI/CD pipelines and version control systems (GitLab, GitHub Actions, etc.)
Familiarity with monitoring tools (Datadog) and security solutions (HashiCorp Vault, Cloud Armor)
Experience with SQL Server and PostgreSQL for database management
Kubernetes expertise, including Helm charts and ArgoCD for application deployment and orchestration
Excellent communication skills to collaborate with engineers, product managers, and business stakeholders
Strong organizational skills and attention to detail
Ability to prioritize tasks and make accurate decisions under pressure
Passion for mentoring and guiding team members

Sr. Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: