Amerit Consulting is a fast-growing staffing and consulting firm that provides services to Fortune 500 companies. They are seeking a Senior Site Reliability Engineer to operate, stabilize, and improve highly available systems while driving reliability automation using AI across services.
Responsibilities:
- Design, operate, and support scalable, observable production systems in Azure
- Participate in and lead on-call rotations and high-severity incident response
- Conduct root cause analysis, blameless post-incident reviews, and implement corrective actions
- Own and enhance observability using Dynatrace (dashboards, alerts, SLIs/SLOs)
- Troubleshoot production issues across Java-based services, Kubernetes, and cloud infrastructure
- Collaborate with cross-functional teams to reduce risk and operational toil
- Design and build AI-driven automation for incident ingestion, triage, investigation, and remediation using multi-agent patterns
- Develop automation for incident communication, reporting, and continuous improvement
- Remain accountable for system reliability and AI-driven operations in production
- Combine software engineering and systems expertise to automate workflows, improve performance, and enhance system resilience
- Develop CI/CD pipelines, manage infrastructure, improve monitoring and observability, and support Java applications in production environments
- Define reliability standards, influence system architecture, lead incident response efforts, and serve as an escalation point for critical issues
- Drive proactive reliability improvements through monitoring and reporting
- Communicate system health and risks to leadership and mentor team members while supporting hiring and onboarding efforts
- Ensure compliance with HIPAA and organizational security and regulatory requirements
Requirements:
- Participation in a scheduled on-call rotation is required
- 7+ years of Site Reliability Engineering or Production Engineering experience
- Strong experience with Azure cloud infrastructure, Kubernetes, Docker, Java production systems, CI/CD (GitHub Actions), and observability platforms (Dynatrace preferred)
- Demonstrated experience automating infrastructure and operational workflows
- Deep understanding of SRE principles (SLIs, SLOs, error budgets)
- Experience with Ansible
- Solid understanding of Linux and Windows system administration
- Experience working with onsite and offshore teams
- Strong communication skills (written and verbal)
- Strong organizational skills and attention to detail
- Strong analytical and problem-solving skills
- Experience designing automation that replaces or materially reduces on call toil
- Experience building or orchestrating AI agents applied to operational workflows
- Familiarity with multi agent architectures or distributed automation systems
- Strong judgment around risk management, safety boundaries, and human in the loop design
- Experience working in healthcare or regulated environments
- Experience in healthcare software or compliance solutions is a plus