System One is seeking a highly skilled Sr. Cloud Engineer to support the daily operations and long-term reliability of their cloud-based infrastructure. This role is critical for ensuring uptime, performing proactive maintenance, troubleshooting issues and implementing fixes across cloud environments.
Responsibilities:
- Deploy applications across multiple environments (dev, staging, prod) and ensure consistency and stability
- Build reusable pipeline templates, jobs and stages for CI/CD consistency across teams
- Collaborate with developers to containerize and deploy applications using ECS and Lambda
- Configure GitLab Runners and manage environment-specific variables and secrets
- Define and deploy readiness and liveness probes for containers running in EKS/ECS
- Write custom scripts for CloudWatch custom metrics and alarms based on application specific probes
- Monitor deployments and system health using CloudWatch and other tools
- Implement rollback strategies and manage version control during deployments
- Troubleshoot and resolve deployment issues and improve pipeline performance and reliability
- Proficient with Python, Bash, YAML/JSON, Node.js, Lambda functions
- Perform daily health checks using AWS CLI or scheduled Lambda scripts to check health and log/report results
- Set up monitoring thresholds, dashboards, and metrics for application and infrastructure
- Perform root cause analysis and incident correlation using monitoring and performance analysis tools
- Maintain a central inventory of all licensed software deployed in AWS environments
- Maintain accurate documentation on infrastructure and procedures
- Patch assessment and maintenance of infrastructure software, to include third party software patches
- Develop a patch testing schedule and rollout plan to include rollback and recovery
- Create and manage change records. Participate in PI planning/ Agile ceremonies
- Keep cloud environments compliant with security standards and best practices
- Orchestrate failover and restoration of ECS/ EKS services, Lambda functions, databases and other infrastructure components
- Test and document regional failover playbooks and recovery runbooks
- Ensure compliance with RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements
- Participate in on-call rotations to support 24/7 production systems and respond to incidents as they arise
Requirements:
- BA/BS in IT, Computer Science or related field (or equivalent work experience may be accepted in lieu of the degree
- 8+ years of IT experience. 5+ years of experience in cloud support, infrastructure maintenance or IT operations
- Experience with Infrastructure as Code (Terraform, CloudFormation)
- Strong proficiency in AWS Lambda (writing, deploying and, optimizing)
- Hands-on experience with CI/CD tools (GibHub, GitLab, EKS, Kubernettes, DevOps)
- Scripting skills for automation and maintenance tasks (Bash, Python)
- Cloud certifications (AWS DevOps Engineer, Solutions Architect Associate)
- Strong written and verbal communication skills for technical and non-technical stakeholders
- Excellent analytical and problem-solving skills
- Ability to diagnose performance issues in cloud environments
- Pre-check and post-check scripts for validating system health
- Familiarity with container orchestration (Docker, ECS, Kubernetes)
- Knowledge of ITIL practice or incident management frameworks