Cloudbeds is a company transforming hospitality through their software platform, serving properties worldwide. The Senior Site Reliability Engineer will ensure the reliability and performance of the platform, architect scalable AWS solutions, and foster a culture of automation and continuous improvement within the engineering teams.
Responsibilities:
- Design and implement reliable and scalable AWS architecture to meet the needs of the organization
- Maintain and support highly loaded Kubernetes (EKS) clusters and infrastructure-related components
- Support the CICD process with ArgoCD and GitOps
- Automate the platform deployments with Terraform infrastructure-as-code
- Develop and continuously improve product Observability and Monitoring systems based on the Grafana, Prometheus, DataDog, and Cloudwatch
- Respond and participate with Incident Management and Root Cause Analysis, ensuring minimal impact on services
- Optimize system performance and troubleshoot issues as they arise
- Collaborate with development teams to establish monitoring best practices and ensure systems meet reliability targets
- Collaborate with security teams to implement and maintain security best practices
- Infrastructure support rotation providing guidance to other engineering teams
Requirements:
- 5+ years of experience as a DevOps or SRE working within the AWS ecosystem
- 5+ years of experience with Kubernetes (EKS) and Helm charts
- Experience with designing, building, and supporting CI/CD pipelines with ArgoCD and GitHub actions
- Experience with infrastructure-as-code methodologies with Terraform
- Experience with Observability and Monitoring with Grafana, Prometheus, DataDog, and Cloudwatch
- Experience with Incident Management, full stack troubleshooting, performance analysis and root cause analysis (RCA)
- Experience with Web application systems such as Nginx, Ingress controllers, load balancing and Content Delivery Networks
- Experience with Databases (MySQL, PostgreSQL, Aurora) and Middleware technologies (Redis, Memcached and SQS)
- Good networking skills with VPC, Security Groups and Network ACLs
- Ability to work remotely and manage your own time in a global team
- Good written and verbal communication in English
- Bachelor's degree in Computer Science or equivalent experience
- Advanced experience with Database Administration (Aurora, MySQL, PostgreSQL)
- Experience working in a PCI-compliant environment
- Experience working with Kong API Gateway