Amerit Consulting is a fast-growing staffing and consulting firm, and they are seeking an accomplished Senior Site Reliability Engineer. This role focuses on operating, stabilizing, and improving highly available systems while driving reliability automation using agentic AI across services.
Responsibilities:
- Design, operate, and support scalable, observable production systems in Azure
- Participate in and lead on-call rotations and high-severity incident response
- Conduct root cause analysis, blameless post-incident reviews, and implement corrective actions
- Own and enhance observability using Dynatrace (dashboards, alerts, SLIs/SLOs)
- Troubleshoot production issues across Java-based services, Kubernetes, and cloud infrastructure
- Collaborate with cross-functional teams to reduce risk and operational toil
- Design and build AI-driven automation for incident ingestion, triage, investigation, and remediation using multi-agent patterns
- Develop automation for incident communication, reporting, and continuous improvement
- Remain accountable for system reliability and AI-driven operations in production
- Combine software engineering and systems expertise to automate workflows, improve performance, and enhance system resilience using tools such as Azure, Kubernetes, Docker, GitHub Actions, Dynatrace, Python, Bash, and Ansible
- Develop CI/CD pipelines
- Manage infrastructure
- Improve monitoring and observability
- Support Java applications in production environments
- Define reliability standards
- Influence system architecture
- Lead incident response efforts
- Serve as an escalation point for critical issues
- Drive proactive reliability improvements through monitoring and reporting
- Communicate system health and risks to leadership
- Mentor team members while supporting hiring and onboarding efforts
- Ensure compliance with HIPAA and organizational security and regulatory requirements
Requirements:
- Participation in a scheduled on-call rotation is required
- 7+ years of Site Reliability Engineering or Production Engineering experience
- Strong experience with Azure cloud infrastructure, Kubernetes, Docker, Java production systems, CI/CD (GitHub Actions), and observability platforms (Dynatrace preferred)
- Demonstrated experience automating infrastructure and operational workflows
- Deep understanding of SRE principles (SLIs, SLOs, error budgets)
- Experience with Ansible
- Solid understanding of Linux and Windows system administration
- Experience working with onsite and offshore teams
- Strong communication skills (written and verbal)
- Strong organizational skills and attention to detail
- Strong analytical and problem-solving skills
- Experience designing automation that replaces or materially reduces on call toil
- Experience building or orchestrating AI agents applied to operational workflows
- Familiarity with multi agent architectures or distributed automation systems
- Strong judgment around risk management, safety boundaries, and human in the loop design
- Experience working in healthcare or regulated environments
- Experience in healthcare software or compliance solutions is a plus