Appspace is a company dedicated to creating better work experiences for people everywhere. They are seeking a Senior Site Reliability Engineer to ensure the reliability, performance, and scalability of their SaaS applications by designing, implementing, and maintaining robust systems and processes.
Responsibilities:
- Executing projects that rollout new platform maintenance features, automate tasks, or other big picture changes to improve our customers’ experience on our Cloud Platform
- Deploying new features and releases of our software into Kubernetes via Helm, so strong experience in Kubernetes and Helm is a must
- Troubleshooting performance issues or errors thrown by the cloud platform or application, and either resolving the underlying cause, or forwarding your research to Engineering to address in the product
- Mentoring others towards technical and procedural success and providing daily operational support to our DevOps team members
- Actioning Request Tickets from other teams in support of their needs to enable and prepare for upcoming releases
- Monitoring and maintaining our Platform’s, uptime, resiliency and performance, looking for improvement opportunities, and proactively taking action to solve any negative trends before they become issues
- Lead, Participate, or Execute within the incident management process when alerts fire, and quickly ascertain root cause, resolve the issue, and find new and creative solutions to prevent recurrence
- Configure, Monitor, Research, and Evaluate workload performances both on Google Cloud Platform and Microsoft Azure Clouds
- Security and Compliance: Work closely with security teams to ensure adherence to security best practices and compliance standards
- Collaborating with our Development and Quality Assurance teams to address issues in the product and platform, particularly around recurring problems
- Documenting new or updating existing processes and procedures to share knowledge and improve on standardized approaches to solution
Requirements:
- Must have a passion for life-long learning
- Must communicate well and adapt to working well with others across different countries and cultures
- Strong background in Containers, Kubernetes, Helm, Linux, Python coding, and some experience with Windows Server OS and MacOS are a must
- Experience with Google Cloud Platform and Microsoft Azure required
- Expert-level troubleshooting experience and the ability to reason through a process workflow to identify a fault or odd behavior (i.e., spending time following log trails)
- Must be flexible on occasionally attending 'off-hour' meetings (we're a global team supporting a global customer base!)
- No travel required for this role
- Experience with administering MySQL & MongoDB preferred
- Experience with administering message brokering systems like RabbitMQ preferred
- Experience with Build pipeline tools and the Atlassian suite (JIRA, Confluence, Bitbucket/Git, Azure DevOps, Bamboo, Octopus)
- Experience with monitoring and alerting platforms, especially StackDriver
- Experience with HashiCorp Terraform
- Experience with IIS