Cayuse is a company focused on empowering organizations to conduct globally connected research through advanced technology and exceptional service. As a Senior Infrastructure Engineer, you will drive the reliability, scalability, and efficiency of the cloud-based infrastructure and SaaS products, while mentoring colleagues and contributing to the improvement of SRE practices.
Responsibilities:
- Serve as a technical expert and mentor to other engineers, sharing knowledge and best practices
- Lead by example, demonstrating strong technical proficiency in SRE principles and practices, specifically within the AWS ecosystem
- Contribute to the development and implementation of SRE standards and guidelines, tailored to AWS best practices
- Foster a culture of continuous learning and improvement within the team
- Help others to grow their automation skillsets
- Design, build, and maintain robust and scalable infrastructure using Terraform, leveraging AWS services effectively
- Develop and optimize CI/CD pipelines using Bitbucket Pipelines, integrating seamlessly with AWS deployment strategies
- Implement and maintain monitoring and logging solutions to ensure system observability, utilizing AWS monitoring tools
- Automate infrastructure and operational tasks to reduce toil and improve efficiency, with a focus on AWS automation
- Contribute to the development and maintenance of automation tools and scripts
- Troubleshoot complex infrastructure and application issues within the AWS environment
- Respond to on-call Sev 1 incidents, particularly those occurring during the Australian (AU) time zone, and participate in a 24/7 on-call rotation approximately once per month
- Participate in incident response and root cause analysis, contributing to the resolution of critical issues on AWS
- Define and monitor SLOs/SLAs to ensure system reliability, using AWS metrics and monitoring
- Contribute to disaster recovery planning and testing, utilizing AWS disaster recovery capabilities
- Analyze system performance and identify areas for improvement within AWS
- Proactively find and resolve potential issues before they become incidents
- Collaborate with development, operations, and other teams to ensure smooth and efficient operations on AWS
- Contribute to code reviews and technical discussions
- Identify and implement process improvements to enhance team efficiency and effectiveness
- Document best practices and create knowledge-sharing resources
- Participate in agile ceremonies
Requirements:
- Deep experience with AWS, including core services like EC2, S3, RDS, Lambda, CloudWatch, EKS, and a solid understanding of AWS networking (VPC, Security Groups) and security fundamentals (IAM)
- 4+ years of experience working with public cloud technologies (AWS preferred)
- 4+ years of experience developing monitoring and log analysis tools, including proficiency with Grafana and New Relic
- Deep understanding of Site Reliability Engineering (SRE) principles, platforms, and tools
- Proven experience with Terraform and Bitbucket Pipelines
- Strong understanding of CI/CD pipelines and SDLC
- Experience with Docker and Kubernetes
- Proficiency in scripting languages (bash, Python)
- Experience implementing and managing security controls and tools
- Understanding of security systems and best practices
- Experience with git and code branching/merging strategies
- Experience with Agile methodologies (Scrum, Kanban)
- Strong problem-solving and troubleshooting skills
- Excellent communication and collaboration skills
- Passion for mentoring and sharing knowledge
- Automation-first mindset
- Ability to own medium to large technical projects
- Candidates MUST reside in Australia to be considered