Granicus is a company focused on transforming the Govtech industry through technology that connects governments and their constituents. They are seeking a Senior Site Reliability Engineer to ensure the reliability, scalability, and performance of their services, while leading efforts in building and maintaining robust infrastructure and automating processes.
Responsibilities:
- On-call Production Support: Provide production support on a shift according to the team on-call roster
- Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support. For example, a client may request to correct some data on the database server which cannot be done through the web interface
- Work on SREs backlog items
- Monitor and Maintain Systems: Continuously monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability
- Automate Processes: Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention
- Incident Management: Assist in troubleshooting and resolving incidents, performing root cause analysis, and implementing long-term fixes to prevent recurrence
- System Improvements: Participate in designing and implementing system improvements to enhance reliability, scalability, and performance
- Collaboration: Work closely with software engineers to understand application requirements, provide feedback on design and architecture, and support deployment and release processes
- Documentation: Create and maintain documentation for processes, procedures, and troubleshooting guides to ensure knowledge sharing within the team
- Capacity Planning: Assist in capacity planning activities to anticipate future needs and ensure that our infrastructure can handle growth
- Security: Implement and adhere to security best practices to protect our systems and data
Requirements:
- 5+ years in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems
- Experience supporting AI/ML infrastructure, including model deployment, inference optimization, and integration with services like AWS Bedrock
- Expertise in Linux/Unix systems, and cloud platforms (AWS, Azure, or Google Cloud)
- Strong proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++)
- Familiarity with AI/ML operations, including model lifecycle management, vector databases, and inference performance tuning
- Experience with the ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging, monitoring, and observability
- Experience with configuration management tools (Ansible, Chef, Puppet)
- Exposure to AI/ML toolchains, including AWS Bedrock, SageMaker, and LLMOps frameworks
- Certifications: Relevant certifications such as AWS Certified DevOps Engineer, AWS Certified Machine Learning – Specialty, Google Cloud Professional DevOps Engineer, or similar