System One is seeking a highly skilled Sr. Cloud Engineer to support the daily operations and long-term reliability of their cloud-based infrastructure. This role is critical for ensuring uptime, performing proactive maintenance, troubleshooting issues and implementing fixes across cloud environments.

Responsibilities:

Deploy applications across multiple environments (dev, staging, prod) and ensure consistency and stability
Build reusable pipeline templates, jobs and stages for CI/CD consistency across teams
Collaborate with developers to containerize and deploy applications using ECS and Lambda
Configure GitLab Runners and manage environment-specific variables and secrets
Define and deploy readiness and liveness probes for containers running in EKS/ECS
Write custom scripts for CloudWatch custom metrics and alarms based on application specific probes
Monitor deployments and system health using CloudWatch and other tools
Implement rollback strategies and manage version control during deployments
Troubleshoot and resolve deployment issues and improve pipeline performance and reliability
Proficient with Python, Bash, YAML/JSON, Node.js, Lambda functions
Perform daily health checks using AWS CLI or scheduled Lambda scripts to check health and log/report results
Set up monitoring thresholds, dashboards, and metrics for application and infrastructure
Perform root cause analysis and incident correlation using monitoring and performance analysis tools
Maintain a central inventory of all licensed software deployed in AWS environments
Maintain accurate documentation on infrastructure and procedures
Patch assessment and maintenance of infrastructure software, to include third party software patches
Develop a patch testing schedule and rollout plan to include rollback and recovery
Create and manage change records. Participate in PI planning/ Agile ceremonies
Keep cloud environments compliant with security standards and best practices
Orchestrate failover and restoration of ECS/ EKS services, Lambda functions, databases and other infrastructure components
Test and document regional failover playbooks and recovery runbooks
Ensure compliance with RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements
Participate in on-call rotations to support 24/7 production systems and respond to incidents as they arise

Requirements:

BA/BS in IT, Computer Science or related field (or equivalent work experience may be accepted in lieu of the degree
8+ years of IT experience. 5+ years of experience in cloud support, infrastructure maintenance or IT operations
Experience with Infrastructure as Code (Terraform, CloudFormation)
Strong proficiency in AWS Lambda (writing, deploying and, optimizing)
Hands-on experience with CI/CD tools (GibHub, GitLab, EKS, Kubernettes, DevOps)
Scripting skills for automation and maintenance tasks (Bash, Python)
Cloud certifications (AWS DevOps Engineer, Solutions Architect Associate)
Strong written and verbal communication skills for technical and non-technical stakeholders
Excellent analytical and problem-solving skills
Ability to diagnose performance issues in cloud environments
Pre-check and post-check scripts for validating system health
Familiarity with container orchestration (Docker, ECS, Kubernetes)
Knowledge of ITIL practice or incident management frameworks

Senior Cloud Engineer

Key skills

About this role

Responsibilities:

Requirements: