Granicus is a company transforming the Govtech industry by building, implementing, and maintaining technology that connects governments and their constituents. They are seeking a Senior Site Reliability Engineer to ensure the reliability, scalability, and performance of their services, lead infrastructure efforts, and implement best practices in site reliability.
Responsibilities:
- Provide production support on a shift according to the team on-call roster
- While not on-call for production support , work on SRE projects and Tech support escalated and internal engineering/implementation team raised tickets
- Work on SREs backlog items
- Continuously monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability
- Proactively monitors the overall uptime and availability of critical Services
- Effectively identifies & addresses monitoring and observability gaps
- Implements effective alerting & notifications, minimizing false alerts
- Creates and manages effective SRE Dashboards to report Key business metrics, SLAs, SLOs, SLIs & error budgets
- Ensure SREs are meeting or improving on established SLOs
- Proactively & effectively evaluates capacity planning to handle growth - scalability & traffic load
- Contributes to innovative solutions like AI Assistant for proactive issue detection & response
- Actively participates and tracks execution of SRE projects aimed at improving system reliability
- Effectively collaborates with cross teams to prevent reliability issues
- Reviews change management tickets to identify and mitigate potential risks to system reliability
- Ensure active participation in change activities and verify that accurate validations are performed by SRE & Engineering teams post implementation
- Participate in architecture reviews & assess the impact of architectural decisions on system reliability
- Initiatives to perform chaos experiments to continuously learn and improve performance & stability of our systems
- Contributes to innovative solutions that enhance system reliability & scalability
- Actively participate in troubleshooting and resolving incidents, performing root cause analysis, Incident post mortems and implementing long-term fixes to prevent recurrence
- Acknowledge & quick recovery from incidents
- Maintains quality of Root cause analysis (RCA) and corrective action plans
- Proactively monitors, measures & adheres to optimal MTTR & MTTA requirements
- Improves quality of SOPs ,Adapts AI tools to reduce MTTR
- Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention
- Partner closely with DevOps and Software Engineering teams to enhance system reliability. Provide constructive feedback on design and architecture, actively support and monitor change and release processes, participate in risk assessments, PI planning, change reviews, and Go/No-Go decision calls. Actively present monitoring and observability status both pre- and post-release to all stakeholders involved in the release or change process
- Create and maintain documentation for technology, architecture, processes, procedures, and troubleshooting guides and provides knowledge sharing within the team
- Ensures completeness & accuracy of information
- Contributes to innovative solutions to build AI based knowlegebase
- Implement and adhere to security best practices to protect our systems and data
Requirements:
- Expertise in Monitoring/Observability - Elastic & Cloud watch/Azure Monitor
- Expertise in Linux/Windows OS & networking
- Advanced knowledge of Cloud services (AWS & Azure)
- Advanced knowledge of Container Technologies - Dockers & Kubernetes (K8s)
- Proficiency on Database/Queries - MSSQL, Postgres, Mongodb, Mysql
- Proficiency in Scripting - Python/Powershell / Bash
- Working experience on CI/CD Tools - Gitlab/Azure Devops or similar tools
- Working experience on IaC Tools - Terraform/Ansible
- Working experience on Configuration management - Chef
- Working experience on Incident response - Pagerduty, Jira
- AI Tools - Copilot, VS code AI agents or similar
- Bachelor's or Master's degree in Computer Science, Information Technology, or a related field, or equivalent practical experience
- At least 8+ years of relevant experience in site reliability engineering with a proven track record of managing complex, medium to large scale high-availability systems
- Strong analytical and problem-solving skills with the ability to diagnose and resolve complex issues efficiently
- Excellent verbal and written communication skills, with the ability to convey complex technical concepts to non-technical stakeholders
- Demonstrated ability to lead and mentor a team, drive projects to completion, and manage cross-functional initiatives
- Relevant certifications such as Elastic Certified Observability Engineer, AWS Certified Solutions Architect, Certified Kubernetes Administrator, or those with Equivalent hands-on experience is highly valued