Contribute to the operation, maintenance, and continuous improvement of Accela's production cloud environments.
Support platform modernization initiatives, including containerization, cloud-native technologies, and automation efforts.
Monitor platform health, availability, performance, and capacity using modern observability and monitoring tools.
Participate in incident response activities, troubleshooting production issues and contributing to Root Cause Analysis efforts.
Develop and maintain automation, tooling, and scripts that improve reliability, scalability, deployment efficiency, and operational effectiveness.
Support the implementation and monitoring of service level objectives (SLOs), service level agreements (SLAs), and operational metrics.
Partner with Development, DevOps, Database Engineering, and Security teams to identify and resolve reliability, performance, and scalability challenges.
Assist with platform deployments, operational readiness reviews, and change management activities.
Contribute to observability initiatives through monitoring, logging, metrics collection, and distributed tracing.
Support compliance-related operational activities associated with SOC 2, HIPAA, FedRAMP, StateRAMP, and PCI-DSS environments.
Participate in post-incident reviews and contribute to corrective and preventive actions that improve platform stability.
Requirements
4+ years of experience in Site Reliability Engineering, Cloud Operations, Systems Engineering, DevOps, Software Engineering, or a related technical discipline.
Experience supporting cloud-based SaaS environments, preferably within Microsoft Azure.
Experience with Kubernetes and containerized application environments.
Working knowledge of scripting and automation using Python, PowerShell, Bash, or similar languages.
Experience troubleshooting distributed systems across application, infrastructure, networking, and operating system layers.
Familiarity with monitoring, logging, metrics, and observability platforms.
Strong analytical and problem-solving skills with a structured approach to troubleshooting and Root Cause Analysis.
Experience working within Incident, Problem, and Change Management processes.
Strong written and verbal communication skills and the ability to work effectively with cross-functional teams.