StratusGrid is building Stratusphere™, a multi-agent platform that turns cloud complexity into measurable outcomes. The Senior Cloud Optimization Engineer will assess, design, and execute cloud optimization work in real customer environments while validating and improving agent-driven workflows. This role involves collaborating with customer stakeholders to deliver measurable savings and operational improvements through effective communication and technical expertise.
Responsibilities:
- Customer Outcomes & Optimization Delivery: Manually assess customer AWS/Azure environments (and eventually GCP), identify optimization opportunities, quantify impact, propose solutions, and execute approved changes safely and efficiently, delivering measurable savings and operational improvements
- Stratusphere Output Review & Calibration: Review cost-savings opportunities, recommendations, and execution plans generated by Stratusphere. Validate assumptions, safety, feasibility, and expected impact; approve or reject proposed work; and provide structured feedback that improves agent accuracy, reliability, and customer readiness over time
- Build Trust & Navigate Stakeholders: Build and nurture strong customer relationships through clear communication, genuine care, and consistent follow-through. Confidently navigate technical discussions with both technical and non-technical stakeholders, overcome objections by bringing options and recommendations, and make approvals easy
- Big-Picture Decision Support: Connect day-to-day optimization work to customer goals and StratusGrid’s strategy. Explain the business and technical implications of decisions (cost, risk, reliability, performance, operational overhead), and guide stakeholders toward win-win outcomes
- Agent Improvement Feedback Loop: Capture patterns in agent errors and blind spots (e.g., missing context, risky sequencing, unclear rollback, incomplete stakeholder info). Propose rubric changes, training examples, and workflow improvements; partner with Product/Engineering to measurably improve quality and safety
- Pull Request–Quality Change Proposals: Produce clear, decision-ready deliverables that quantify business value up front, document risks and mitigations, and demonstrate safety/rollback ability, so customers are making decisions, not providing direction. Where Stratusphere drafts artifacts, you will refine and elevate them to StratusGrid standards
- Execution with Reliability & Urgency: Own work end-to-end with a strong sense of responsibility. Capture commitments, hit deadlines, communicate status proactively, escalate early, and close the loop visibly—no surprises
- AI-Enabled Delivery: Use AI tools daily to accelerate investigation, documentation, analysis, and implementation planning, raising quality, reducing cycle time, and improving customer outcomes while maintaining sound engineering judgment
- Agent Work Product Evaluation & Feedback: Evaluate, score, and provide actionable feedback on agent-driven outputs (findings, plans, execution steps, customer comms) to continuously improve Stratusphere’s reliability, safety, and usefulness
- Product Partnership & Roadmap Input: Partner with Product to convert customer problems and recurring friction into clear problem statements, capability gaps, and roadmap recommendations, bringing evidence and pattern recognition from the field
- Cross-Functional Collaboration: Work in high-visibility channels with Engineering, Product, and Customer teams; share context broadly; ask questions early; and support a safe culture where we learn fast without introducing risk through isolation
- Operational Excellence & Standards: Follow StratusGrid customer experience standards, change control processes, documentation expectations, and work-system hygiene to ensure consistency, traceability, and scalability
Requirements:
- Proven ability to operate in production AWS and Azure environments, including multi-account/subscription structures, governance constraints, and enterprise-grade patterns. (GCP familiarity is a plus; willingness to ramp is required.)
- Demonstrated experience implementing real savings and operational improvements across: Compute: rightsizing, scheduling, autoscaling, instance family shifts, Spot strategies where appropriate; Storage: lifecycle/tiering, orphaned volumes/snapshots, retention optimization; Managed Services: tier/sizing optimization, non-prod controls, usage tuning; Networking: egress/data transfer analysis, NAT/GW cost drivers, topology-aware recommendations; Commitments: Understanding of Savings Plans/RIs/Reservations strategy, coverage, and utilization improvements
- Strong IaC skills (Terraform preferred; CloudFormation and/or Bicep/ARM valued), with disciplined Git workflows, PR-based delivery, and an instinct for rollback plans, validation steps, and minimizing blast radius
- Proficiency in automation using a modern programming language (Python, TypeScript, Go, etc); comfort with AWS/Azure CLIs and SDKs to enumerate resources, collect metadata/metrics, and operationalize remediation at scale
- Strong working knowledge of IAM/RBAC and least-privilege operations: AWS IAM roles, cross-account access, SCPs, permission boundaries; Azure Entra ID, RBAC, PIM, Azure Policy; Ability to work effectively within constrained access and compliance requirements
- Ability to use metrics/logs (CloudWatch, Azure Monitor/Log Analytics; Prometheus/Grafana a plus) to assess risk, validate performance impact, and confirm outcomes post-change
- Practical understanding of VPC/VNet constructs, routing, DNS, load balancing, private connectivity patterns, and how architecture decisions affect cost, reliability, and security
- Exceptional written and verbal communication, able to translate complex technical topics into decision-ready narratives for technical and business stakeholders, with clarity, completeness, and empathy. Ability to listen to customer concerns and turn them into actionable solutions
- Proven ability to work through ambiguity, navigate formal and informal org dynamics, align stakeholders, and drive work to resolution without offloading effort to customers
- Track record of meeting commitments, proactively communicating status/risks, escalating early, and maintaining high standards of reliability and follow-through
- Demonstrated habit of using AI tools to accelerate analysis and improve quality (while maintaining rigorous verification, security awareness, and sound judgment)
- Willingness and ability to travel periodically (as needed) for customer engagements, team planning sessions, or onsite work
- Equipped to work effectively in a distributed team environment, including a reliable high-speed internet connection, a professional and distraction-limited workspace, and the ability to consistently communicate, collaborate, and execute independently
- Experience rightsizing EKS/AKS/GKE clusters with a focus on compute strategy and cost optimization. Ability to implement horizontal and vertical pod autoscalers (HPA & VPA) without sacrificing system stability
- CI/CD pipeline experience (GitHub Actions, GitLab CI, Azure DevOps) and policy-as-code exposure (OPA/Sentinel/Azure Policy)
- Experience with enterprise landing zones (AWS Control Tower / Azure Landing Zones)
- Experience querying large billing datasets (Athena/BigQuery/ADX/Power BI). Proven ability to translate raw billing data into actionable recommendations to balance business needs with cloud costs
- A deep understanding of how cost-saving measures may affect the security of cloud environments. Must be proficient in maintaining least privilege and data protection standards when recommending and executing cost optimization opportunities
- Strong grasp of FinOps fundamentals and cost drivers; expert use of native cost tooling and reporting to build credible baselines, forecasts, and realized-savings narratives (e.g., AWS Cost Explorer/CUR, Azure Cost Management exports, budgets, alerts)