Amerit Consulting is a fast-growing staffing and consulting firm that provides services to Fortune 500 companies. They are seeking a Senior Site Reliability Engineer to operate, stabilize, and improve highly available systems while driving reliability automation using AI across services.

Responsibilities:

Design, operate, and support scalable, observable production systems in Azure
Participate in and lead on-call rotations and high-severity incident response
Conduct root cause analysis, blameless post-incident reviews, and implement corrective actions
Own and enhance observability using Dynatrace (dashboards, alerts, SLIs/SLOs)
Troubleshoot production issues across Java-based services, Kubernetes, and cloud infrastructure
Collaborate with cross-functional teams to reduce risk and operational toil
Design and build AI-driven automation for incident ingestion, triage, investigation, and remediation using multi-agent patterns
Develop automation for incident communication, reporting, and continuous improvement
Remain accountable for system reliability and AI-driven operations in production
Combine software engineering and systems expertise to automate workflows, improve performance, and enhance system resilience
Develop CI/CD pipelines, manage infrastructure, improve monitoring and observability, and support Java applications in production environments
Define reliability standards, influence system architecture, lead incident response efforts, and serve as an escalation point for critical issues
Drive proactive reliability improvements through monitoring and reporting
Communicate system health and risks to leadership and mentor team members while supporting hiring and onboarding efforts
Ensure compliance with HIPAA and organizational security and regulatory requirements

Requirements:

Participation in a scheduled on-call rotation is required
7+ years of Site Reliability Engineering or Production Engineering experience
Strong experience with Azure cloud infrastructure, Kubernetes, Docker, Java production systems, CI/CD (GitHub Actions), and observability platforms (Dynatrace preferred)
Demonstrated experience automating infrastructure and operational workflows
Deep understanding of SRE principles (SLIs, SLOs, error budgets)
Experience with Ansible
Solid understanding of Linux and Windows system administration
Experience working with onsite and offshore teams
Strong communication skills (written and verbal)
Strong organizational skills and attention to detail
Strong analytical and problem-solving skills
Experience designing automation that replaces or materially reduces on call toil
Experience building or orchestrating AI agents applied to operational workflows
Familiarity with multi agent architectures or distributed automation systems
Strong judgment around risk management, safety boundaries, and human in the loop design
Experience working in healthcare or regulated environments
Experience in healthcare software or compliance solutions is a plus

Senior Site Reliability Engineer Largely Remote

Key skills

About this role

Responsibilities:

Requirements:

Senior Site Reliability Engineer ** Largely Remote **

Key skills

About this role

Responsibilities:

Requirements:

Senior Site Reliability Engineer Largely Remote