Career Renew is recruiting for a Senior DevOps / SRE Engineer - Trading for one of its clients. The role involves owning and maintaining the infrastructure that supports autonomous AI trading agents, ensuring reliability, security, and performance at scale.
Responsibilities:
- Build and maintain the infrastructure that runs dozens of concurrent AI trading agents per user — each with their own cron schedules, state files, and trailing stop processes
- Deploy and manage OpenClaw agent environments, including workspace persistence, cron orchestration, isolated session management, and MCP server connectivity
- Design and operate CI/CD pipelines for shipping trading skills, plugins, and agent updates to production without interrupting live trading
- Define and execute deployment strategies for production systems, including zero-downtime rollouts, safe rollback mechanisms, and release reliability for live trading workloads
- Ensure zero-downtime deployments — active positions must remain protected through every infrastructure change
- Build monitoring, alerting, and observability across the full stack using metrics, logs, traces, and dashboards that catch agent failures, orphaned positions, state file corruption, infrastructure regressions, and MCP auth expiration before they cost money
- Manage cloud infrastructure across multiple environments with infrastructure-as-code
- Operate and scale core platform infrastructure including Kubernetes/EKS clusters, containerized workloads, Redis, Postgres/RDS, ClickHouse, Kafka, and blockchain-adjacent services
- Operate blockchain node infrastructure and ensure reliable connectivity to Hyperliquid APIs, on-chain transaction systems, and wallet operations
- Own logging, observability, security, and incident response across the full agent stack
- Lead incident response and on-call practices across the platform, including debugging, mitigation, postmortems, and long-term reliability improvements
- Own backup, recovery, and disaster-readiness for critical infrastructure and trading-supporting systems
Requirements:
- Professional DevOps, SRE, or infrastructure engineering experience, ideally in a startup where you built systems from scratch rather than only maintaining existing systems
- Strong Kubernetes experience — deploying, scaling, and debugging production workloads, ideally on AWS EKS
- Hands-on experience with Docker and Helm for packaging and operating production services
- Proficiency with infrastructure-as-code such as Terraform, Ansible, or equivalent
- Experience with CI/CD and deployment automation using GitHub Actions, ArgoCD, or similar systems
- Strong AWS infrastructure experience; multi-cloud experience is a plus
- Experience operating production data and messaging systems such as Redis, Postgres/RDS, ClickHouse, and Kafka
- Strong observability experience with Prometheus, Grafana, Datadog, Loki, ELK/OpenSearch/Kibana, OpenTelemetry, or equivalent tooling
- Ability to build dashboards, alerts, and operational visibility that surface problems before they escalate
- Ability to debug across languages such as Python, Node.js, and Go — you'll be tracing issues through agent scripts, MCP servers, platform services, and infrastructure
- Experience owning security-related infrastructure concerns such as access management, secrets handling, production hardening, and operational controls
- Experience with incident management, on-call operations, and backup/recovery planning for production systems
- Understanding of real-time systems where latency and reliability directly impact financial outcomes — cron jobs that must fire on schedule, state files that cannot corrupt, and atomic operations under concurrent load
- Experience designing deployment strategies for systems that cannot tolerate interruption during live financial activity
- Familiarity with blockchain or node infrastructure, exchange APIs, wallet operations, and on-chain monitoring
- Experience with or willingness to learn MCP (Model Context Protocol) server deployment, auth management, and the agent-to-tool connectivity layer
- Hyperliquid experience is a plus, but not required
- Experience with OpenClaw, including agent deployments, workspace templates, cron systems, environment management, and session orchestration
- Experience with multi-agent systems — orchestrating many independent processes that share infrastructure but operate autonomously
- Background in trading systems, market data infrastructure, blockchain infrastructure, or fintech DevOps where uptime has direct financial consequences
- Experience defining SLOs, improving operational maturity, and building reliable on-call practices in fast-moving production environments