Ascendion is a full-service digital engineering solutions company that specializes in software platforms and products. The Lead Cloud Engineer will design, implement, and manage observability solutions while collaborating with various teams to ensure the reliability and performance of critical systems in multi-cloud environments.
Responsibilities:
- Design and Implement Monitoring Solutions: Architect and deploy scalable monitoring and observability frameworks using Datadog, Grafana, and Prometheus across AWS, Azure, or GCP environments
- Build Dashboards & Alerts: Develop comprehensive dashboards, alerts, and performance indicators to enable proactive monitoring of applications, microservices, APIs, and infrastructure
- Integrate Observability Tools: Integrate Datadog and Grafana with CI/CD pipelines, logging (ELK/CloudWatch), APM, and tracing systems for unified observability
- Define KPIs and SLOs: Work with application and platform teams to define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets
- Automation & Infrastructure as Code (IaC): Automate provisioning, configuration, and management of monitoring tools using Terraform, CloudFormation, or Ansible
- Performance Analysis & Incident Management: Analyze performance bottlenecks, conduct root cause analysis, and drive post-incident reviews with actionable insights
- Governance & Best Practices: Establish observability standards, documentation, and best practices across development and operations teams
- Mentoring & Leadership: Lead a small team of cloud engineers, providing technical guidance, code reviews, and skill development support
Requirements:
- 8+ years of experience in Cloud Engineering, DevOps, or Site Reliability Engineering roles
- Proven expertise with Datadog (metrics, logs, APM, RUM, synthetic monitoring, custom dashboards, alerting)
- Strong hands-on experience with Grafana, Prometheus, and OpenTelemetry
- Solid understanding of cloud platforms - AWS, Azure, or GCP (certification preferred)
- Proficiency in Infrastructure as Code (Terraform/CloudFormation) and automation scripting (Python, Bash, PowerShell)
- Experience implementing observability pipelines, log forwarding, and distributed tracing
- Familiarity with Kubernetes, Docker, and microservices architecture
- Strong problem-solving, analytical, and communication skills
- Experience implementing monitoring for Kubernetes clusters and serverless workloads
- Knowledge of incident response automation and AIOps capabilities in Datadog
- Exposure to security monitoring and compliance dashboards
- AWS Certified DevOps Engineer, Azure DevOps Expert, or Datadog Certified Professional preferred