AnsibleAWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformKubernetesLinuxPythonSplunkTerraformUnixBashPowerShellAmazon Web ServicesGCPGoogle CloudServerlessIAMCloudWatchAzure MonitorDynatraceLoad Balancing
About this role
Role Overview
Design, implement, and support fault-tolerant, highly available architectures across Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), including redundancy, load balancing, and automated failover strategies.
Deploy, manage, and optimize cloud infrastructure using infrastructure-as-code (IaC) tools such as Terraform and Ansible.
Implement and maintain monitoring, alerting, and logging solutions using tools such as Splunk, Azure Monitor, Dynatrace, and AWS CloudWatch or similar to detect and resolve issues proactively.
Lead incident response activities, including real-time troubleshooting, root-cause analysis, post-incident reviews, and continuous improvement actions to increase uptime and resilience.
Perform capacity planning and performance engineering by forecasting demand, tuning systems, and implementing autoscaling and performance best practices.
Develop and maintain automation scripts and internal tools using Python, PowerShell, Bash, or similar languages to reduce manual intervention and operational toil.
Collaborate with security teams to implement secure infrastructure practices including encryption, role-based access control, auditing, and vulnerability management.
Work closely with engineering and DevOps teams to promote reliability best practices and contribute to a collaborative, blameless culture that improves consistency and quality of operations.
Requirements
5+ years of experience in cloud site reliability engineering, DevOps engineering, or systems engineering supporting large-scale, distributed systems in public cloud environments.
Experience with at least one major public cloud platform such as AWS, Azure, or GCP, including virtual private clouds (VPCs), identity and access management (IAM), serverless components, and managed Kubernetes services.
Experience with containers and orchestration technologies such as Docker and Kubernetes in production environments.
Experience with infrastructure-as-code tools such as Terraform and Ansible to provision, configure, and manage cloud infrastructure.
Experience implementing monitoring, logging, and observability solutions using tools such as Splunk, Azure Monitor, Dynatrace, AWS CloudWatch or similar.
Experience administering Linux or Unix and Windows operating systems, including system administration and networking fundamentals.
Experience with programming or scripting languages such as Python, PowerShell, Bash, or similar to automate system management and operational tasks.
Bachelors degree or higher in Computer Science, Engineering, Information Technology or related field or equivalent combination of education, related experience and/or military experience.