INSPYR Solutions is seeking a highly skilled Sr. Cloud Engineer to support the daily operations and long-term reliability of cloud-based infrastructure. This role is critical for ensuring uptime, performing proactive maintenance, troubleshooting issues, and implementing fixes across cloud environments.
Responsibilities:
- Deploy applications across multiple environments (development, staging, production) and ensure consistency and stability
- Build reusable pipeline templates, jobs, and stages for CI/CD consistency across teams
- Collaborate with developers to containerize and deploy applications using ECS and Lambda
- Configure GitLab Runners and manage environment-specific variables and secrets
- Define and deploy readiness and liveness probes for containers running in EKS/ECS
- Write custom scripts for CloudWatch metrics and alarms based on application-specific probes
- Monitor deployments and system health using CloudWatch and other tools
- Implement rollback strategies and manage version control during deployments
- Troubleshoot and resolve deployment issues; improve pipeline performance and reliability
- Perform daily health checks using AWS CLI or scheduled Lambda scripts; log and report results
- Set up monitoring thresholds, dashboards, and metrics for applications and infrastructure
- Perform root cause analysis and incident correlation using monitoring and performance tools
- Maintain a central inventory of all licensed software deployed in AWS environments
- Maintain accurate documentation on infrastructure and procedures
- Assess and patch infrastructure software, including third-party software
- Develop a patch testing schedule and rollout plan with rollback and recovery procedures
- Create and manage change records; participate in PI planning and Agile ceremonies
- Keep cloud environments compliant with security standards and best practices
- Orchestrate failover and restoration of ECS/EKS services, Lambda functions, databases, and other components
- Test and document regional failover playbooks and recovery runbooks
- Ensure compliance with RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements
- Participate in on-call rotations to support 24/7 production systems and respond to incidents
Requirements:
- BA/BS in IT, Computer Science, or related field (or equivalent work experience)
- 8+ years of IT experience, including 5+ years in cloud support, infrastructure maintenance, or IT operations
- Experience with Infrastructure as Code (Terraform, CloudFormation)
- Strong proficiency with AWS Lambda (writing, deploying, optimizing)
- Hands-on experience with CI/CD tools (GitHub, GitLab, EKS, Kubernetes, DevOps)
- Scripting skills for automation and maintenance tasks (Bash, Python)
- Cloud certifications (AWS DevOps Engineer, Solutions Architect Associate)
- Strong written and verbal communication skills for technical and non-technical stakeholders
- Excellent analytical and problem-solving skills
- U.S. Citizen; able to obtain and maintain a Public Trust clearance
- Ability to diagnose performance issues in cloud environments
- Experience with pre-check and post-check scripts for system health validation
- Familiarity with container orchestration (Docker, ECS, Kubernetes)
- Knowledge of ITIL practices or incident management frameworks