DataEdge Consulting is seeking a Senior Site Reliability Engineering Architect for a global Fortune 500 company in the Food Services industry. The role focuses on designing automation-first, AI-augmented reliability platforms for large-scale cloud environments while ensuring systems can operate with minimal human intervention.
Responsibilities:
- Design reliability architectures that prioritize automation and intelligent decision-making over manual processes
- Define patterns for fault isolation, graceful degradation, and recovery that assume automated and AI-assisted execution
- Ensure reliability, security, and governance requirements are embedded directly into operational systems and workflows
- Establish architectural standards that reduce complexity, human dependency, and operational risk
- Architect event-driven automation platforms that span detection, decisioning, and execution
- Design and implement workflow orchestration systems capable of handling both low-risk autonomous actions and higher-risk human-approved operations
- Replace ticket-driven and static runbook processes with executable, testable automation
- Standardize automation patterns across incident response, change execution, and platform operations
- Ensure automation systems are resilient, observable, and auditable
- Design and own internal AI-driven operational platforms that act as a centralized interface for reliability and automation workflows
- Build systems that allow intelligent components to retrieve operational context, reason over signals, and invoke controlled actions across infrastructure and services
- Establish architectures for agent coordination, capability discovery, and safe execution in production environments
- Define guardrails, approval paths, observability, and auditability for AI-initiated actions
- Integrate AI-driven decisioning directly into operational workflows rather than treating it as an external enhancement
- Architect observability systems that feed automation and intelligent decision-making rather than static dashboards
- Design signal pipelines that correlate metrics, logs, traces, and events into actionable context
- Reduce alert fatigue through context-aware, noise-resistant detection and prioritization
- Ensure every operational signal has a defined automated or AI-assisted response path
- Drive continuous improvement through trend analysis and systemic remediation
- Define governance-backed use of enterprise low-code automation platforms to accelerate operational workflows
- Enable secure, scalable automation for approvals, communications, enrichment, and orchestration while preventing platform sprawl
- Establish clear boundaries between low-code automation and code-first systems
- Integrate enterprise automation tools with cloud-native automation and AI-driven operational platforms
- Serve as the architectural authority for reliability, automation, and AI-driven operations
- Mentor senior engineers and raise organizational maturity in automation and intelligent systems
- Partner with engineering, security, and compliance teams to deliver safe, scalable operational platforms
- Own reference architectures, operational standards, and long-term technical direction
- Challenge designs that increase operational risk, toil, or manual dependency
Requirements:
- 5+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Infrastructure Engineering supporting complex distributed systems
- Proven experience designing and operating automation-heavy or autonomous operational platforms
- Strong programming and automation skills using modern languages and frameworks
- Hands-on experience with workflow orchestration and event-driven systems
- Practical experience integrating AI or intelligent decision systems into production operations
- Deep understanding of failure modes, blast radius management, and risk-aware automation
- Experience designing or implementing agent-based or AI-assisted operational systems
- Familiarity with modern AI platforms and model integration for operational use cases
- Experience with control-plane architectures for automation and intelligent systems
- Enterprise automation and governance experience
- Knowledge of cost-aware reliability design, FinOps principles, and zero-trust security models
- Relevant cloud or platform certifications