MLabs is a company operating at the intersection of decentralized finance and artificial intelligence, seeking a Senior DevOps / SRE Engineer. The role involves managing infrastructure for autonomous AI trading agents, ensuring reliability and zero-downtime operations in high-stakes financial environments.
Responsibilities:
- Agent Infrastructure Management: Build and maintain the infrastructure for concurrent AI trading agents, managing complex cron schedules, state files, and trailing stop processes
- Deployment & Orchestration: Deploy and manage agent environments, including workspace persistence, isolated session management, and Model Context Protocol (MCP) server connectivity
- CI/CD Pipeline Development: Design and operate pipelines for shipping trading skills and plugins to production without interrupting live trading activity
- Zero-Downtime Operations: Execute deployment strategies (blue/green, canary) ensuring active financial positions remain protected during every infrastructure change
- Observability & Monitoring: Build comprehensive alerting across the full stack using metrics, logs, and traces to detect agent failures, state file corruption, or infrastructure regressions before financial loss occurs
- Cloud & Database Scaling: Operate and scale core platform infrastructure, including Kubernetes (EKS) clusters, Redis, Postgres, ClickHouse, and Kafka
- Blockchain Reliability: Maintain blockchain node infrastructure and ensure stable connectivity to exchange APIs and on-chain transaction systems
- Incident Leadership: Lead incident response and on-call practices, including debugging, mitigation, and post-mortems to improve long-term platform reliability
Requirements:
- Extensive experience in DevOps, SRE, or Infrastructure Engineering, preferably within a startup environment where systems were built from the ground up
- Proven track record of deploying, scaling, and debugging production workloads, specifically within AWS EKS
- Proficiency with tools such as Terraform, Ansible, or equivalent frameworks
- Hands-on experience with Docker and Helm for packaging production services
- Experience operating production-grade data and messaging systems (Redis, Postgres/RDS, ClickHouse, Kafka)
- Strong experience with Prometheus, Grafana, Datadog, Loki, or OpenTelemetry to build proactive operational visibility
- Ability to debug across multiple languages, including Python, Node.js, and Go
- Understanding of systems where latency and reliability have direct financial consequences
- Familiarity with node infrastructure, exchange APIs, wallet operations, and on-chain monitoring
- Experience managing secrets, access controls, and production hardening for sensitive financial environments
- Experience defining SLOs and building mature on-call practices
- Experience with OpenClaw agent deployments and workspace templates
- Familiarity with Model Context Protocol (MCP) server deployment and auth management
- Direct experience with Hyperliquid or other decentralized exchange (DEX) protocols
- Background in fintech, market data infrastructure, or high-frequency trading systems