Autheo is a pioneering company focused on integrating blockchain technology with enterprise solutions. They are seeking a highly skilled L3 Senior Site Reliability Engineer / Cloud Engineer to design, build, and operate reliable cloud infrastructure for blockchain services and Web3 applications.
Responsibilities:
- Architect, deploy, and operate highly available AWS infrastructure optimized for blockchain workloads
- Implement Infrastructure as Code (IaC) using Terraform for repeatable, auditable provisioning
- Manage production container platforms (EKS, ECS, Kubernetes, Docker, ECR)
- Operate and optimize EC2, S3, EBS/FSx, Lambda, and related services
- Design VPCs, VPNs, subnets, security groups, routing, load balancers, and network isolation
- Implement IAM, KMS, Secrets Manager for identity, encryption, and key management
- Apply scaling techniques for RPC endpoints (load balancing, caching, throttling) and manage public/private peer connectivity
- Support and troubleshoot Amazon Linux, Oracle Linux, and Windows Server environments
- Deploy, operate, and maintain blockchain nodes (full/archive/light clients) and RPC endpoints on EVM-compatible chains (Ethereum, Polygon, BNB Chain, etc.)
- Optimize node performance, storage, networking, and containerization using Docker/Kubernetes
- Monitor and troubleshoot blockchain health metrics (block height, peer count, sync status, logs, memory, throughput)
- Support on-chain/off-chain interactions, transactions, gas fees, signing, wallets, smart contract invocations, and state queries
- Troubleshoot blockchain errors (transaction failures, RPC timeouts, indexing lag, sync divergence)
- Work with API gateways and middleware services (Infura, Alchemy, QuickNode equivalents)
- Implement indexing for event logs, state, and transactions using tools like The Graph, ETL pipelines, custom services, or database-backed explorers
- Implement Terraform, Helm, and GitOps workflows for infrastructure lifecycle management
- Enforce resilient, automated, scalable design patterns and collaborate on faster, higher-quality deployments
- Own availability, latency, performance, capacity, SLOs/SLIs/SLAs with observability-driven insights
- Lead on-call rotations, incident response for S1/S2 events, post-incident reviews, and preventive initiatives
- Reduce operational toil through automation; own and build CI/CD pipelines (Jenkins, GitHub Actions), Terraform validation, Docker builds, Helm deployments
- Instrument blockchain workloads for metrics, logs, traces, predictive signals, and anomaly detection using Datadog, Prometheus, Grafana, ELK, CloudWatch, OpenTelemetry, Wazuh
- Build automated alerting, anomaly detection, diagnostics, and end-to-end observability strategies
- Implement AIOps for event correlation, anomaly detection, predictive diagnostics, automated remediation, and self-healing (using AWS SageMaker, Bedrock, and other AI tools)
- Drive security threat detection/prioritization, capacity planning, forecasting, cost control, and reporting
- Enforce cloud security best practices, vulnerability remediation pipelines, and compliance guardrails (SOC2, PCI, ISO27000)
- Manage cryptographic materials, KMS/HSM, wallet abstractions (HD, custodial/non-custodial, multisig)
Requirements:
- 7+ years in Cloud, SRE, Systems, or DevOps Engineering roles
- 5+ years operating production workloads on AWS
- 3+ years supporting blockchain infrastructure, nodes, Web3 applications, DeFi, etc
- Strong hands-on experience with AWS services (EC2, EKS, ECS, S3, RDS/Aurora, VPC/VPN, Route53, ALB/NLB, KMS, IAM, Secrets Manager, Lambda, EventBridge, CloudWatch, ECR)
- Production experience with containers & Kubernetes
- Proficiency with IaC (Terraform, Helm, AWS CDK) and automation/scripting (Python, Bash, or Go preferred)
- Working experience with CI/CD (GitHub Actions, Jenkins, Argo, etc.)
- Demonstrated experience with observability systems (Datadog, Prometheus, OpenTelemetry, ELK, CloudWatch, Wazuh)
- Practical exposure to AIOps concepts (event correlation, predictive diagnostics, anomaly detection, automated response)
- Experience supporting 24×7 on-call rotation for production services
- Strong understanding of distributed systems, reliability patterns, and fault tolerance
- Experience participating in major incident response and post-incident reviews
- AWS Certifications (Solutions Architect, DevOps Engineer, SysOps Administrator)
- Deep experience with blockchain, Web3, or decentralized system operations
- Proven SRE methodology experience, including automation, CI/CD, and IaC development
- Experience in compliance-driven environments (SOC2, PCI, ISO27000)