INSPYR Solutions is seeking a highly skilled Sr. Cloud Engineer to support the daily operations and long-term reliability of cloud-based infrastructure. This role is critical for ensuring uptime, performing proactive maintenance, troubleshooting issues, and implementing fixes across cloud environments.

Responsibilities:

Deploy applications across multiple environments (development, staging, production) and ensure consistency and stability
Build reusable pipeline templates, jobs, and stages for CI/CD consistency across teams
Collaborate with developers to containerize and deploy applications using ECS and Lambda
Configure GitLab Runners and manage environment-specific variables and secrets
Define and deploy readiness and liveness probes for containers running in EKS/ECS
Write custom scripts for CloudWatch metrics and alarms based on application-specific probes
Monitor deployments and system health using CloudWatch and other tools
Implement rollback strategies and manage version control during deployments
Troubleshoot and resolve deployment issues; improve pipeline performance and reliability
Perform daily health checks using AWS CLI or scheduled Lambda scripts; log and report results
Set up monitoring thresholds, dashboards, and metrics for applications and infrastructure
Perform root cause analysis and incident correlation using monitoring and performance tools
Maintain a central inventory of all licensed software deployed in AWS environments
Maintain accurate documentation on infrastructure and procedures
Assess and patch infrastructure software, including third-party software
Develop a patch testing schedule and rollout plan with rollback and recovery procedures
Create and manage change records; participate in PI planning and Agile ceremonies
Keep cloud environments compliant with security standards and best practices
Orchestrate failover and restoration of ECS/EKS services, Lambda functions, databases, and other components
Test and document regional failover playbooks and recovery runbooks
Ensure compliance with RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements
Participate in on-call rotations to support 24/7 production systems and respond to incidents

Requirements:

BA/BS in IT, Computer Science, or related field (or equivalent work experience)
8+ years of IT experience, including 5+ years in cloud support, infrastructure maintenance, or IT operations
Experience with Infrastructure as Code (Terraform, CloudFormation)
Strong proficiency with AWS Lambda (writing, deploying, optimizing)
Hands-on experience with CI/CD tools (GitHub, GitLab, EKS, Kubernetes, DevOps)
Scripting skills for automation and maintenance tasks (Bash, Python)
Cloud certifications (AWS DevOps Engineer, Solutions Architect Associate)
Strong written and verbal communication skills for technical and non-technical stakeholders
Excellent analytical and problem-solving skills
U.S. Citizen; able to obtain and maintain a Public Trust clearance
Ability to diagnose performance issues in cloud environments
Experience with pre-check and post-check scripts for system health validation
Familiarity with container orchestration (Docker, ECS, Kubernetes)
Knowledge of ITIL practices or incident management frameworks

Sr. AWS Cloud DevOps Engineer

Key skills

About this role

Responsibilities:

Requirements: