Inspira Financial is a company focused on helping businesses and individuals thrive in their health and wealth journeys. The Reliability Engineer will work closely with engineering, security, and infrastructure teams to ensure system availability, scalability, and security, while also contributing to incident response and performance optimization.
Responsibilities:
- Partner with the Engineering and Security teams to create, implement and apply SRE principles, processes, and controls
- Build & support Site Reliability function & participate in building tools to monitor and report system KPIs
- Monitoring of Platform and Environment with tools such as Datadog, Azure Monitor, etc
- Configure and Support the Disaster Recovery and Business Resumption Plan as it relates to the backup and restoration of the technology infrastructure
- Ensure run books are updated on a regular basis
- Utilize programming skills to design and develop programs or scripts for various repetitive functions
- Contribute to long-term infrastructure strategies and reliability improvements
- Performs all duties with a focus on goals of Inspira, which includes risk mitigation
- Support inbound calls/emails, maintaining tickets within the issue tracking application related to Infrastructure Support
- Crosstrain other team members to facilitate coverage
- Other duties as assigned
Requirements:
- Bachelor's degree in computer science or equivalent experience
- Minimum 3 years of experience in Information Technology
- 3+ years of role specific experience
- Minimum of 3 years of experience with IaC tools such as Terraform, bash scripting, etc
- Experience supporting Containerization Platforms such as K8s and Docker
- Experience working with Automation tools such as ADO, Jenkins, and Chef
- Experience working with Observability tools such as Datadog and Azure Monitor
- Knowledge of principles such as SLIs, SLOs, and error budgets
- Familiarity with observability concepts beyond monitoring, such as distributed tracing and log correlation
- Knowledge of Virtual Machines and Container concepts
- Knowledge of Security as it relates to Cloud Environments including the Shared Security Model
- Scripting languages such as Powershell, Bash, Python, etc
- Experience with Cloud Services Azure (preferred), Google or AWS
- Experience with BDR solutions such as Veeam, VMWare Site Recovery, and Azure Backup/Site Recovery
- Ability to work independently with minimal supervision
- Must have excellent written and verbal communication skills
- Strong analytical skills, follow-up capability, and problem-solving ability
- Ability to conduct research into hardware and software issues and products as required
- Ability to effectively prioritize and execute tasks in a high-pressure environment
- Ability to use strong interpersonal and presentation skills to share ideas, solutions, and strong working relationships with business units including non-technical users, technical leads, and developers
- Experience working with a ticketing system and internal clients
- Ability to respond to emails and text messages after hours to resolve critical issues
- Must possess strong skills in personal diplomacy and client service while consistently demonstrating a high level of motivation, commitment to teamwork, professionalism and trustworthiness
- Strong vendor management skills
- Highly self-motivated and directed
- Experience in a high availability environment preferred
- Knowledge of ITIL/ITSM practices and framework preferred
- Infrequent travel
- Ability to provide personal transportation from time to time
- Ability to work overtime
- Prolonged periods of sitting at a desk and working on a computer
- Certifications preferred: AZ-900, Datadog Fundamentals