CodeGeniusRecruit is seeking a Backend Engineer to design, build, and optimize distributed infrastructure for AI agents. The role involves developing core backend systems, collaborating with AI teams, and ensuring system performance and reliability.
Responsibilities:
- Design, build, and optimise distributed infrastructure for training, deploying, and scaling AI agents across high-performance compute environments
- Develop core backend systems (services, APIs, and orchestration layers) that support agent lifecycles, tool execution, memory access, and multi-agent coordination
- Collaborate closely with research and applied AI teams to integrate model-serving pipelines, agent reasoning loops, memory stores, and planning components into production systems
- Build and maintain agent runtime infrastructure, including task scheduling, state management, inter-agent communication, and execution reliability
- Implement monitoring, observability, and fault-tolerance mechanisms for long-running agent processes and distributed workflows
- Evaluate and improve system performance across compute, networking, storage, and inference layers, identifying and resolving bottlenecks
- Participate in synchronous collaboration sessions (4-hour windows, 2–3 times per week) to review architecture decisions, troubleshoot distributed systems, and iterate on design improvements
Requirements:
- Strong foundation in Computer Science, Software Engineering, or Systems Design, with experience building large-scale distributed systems
- Proficiency in one or more backend or systems programming languages such as Go, Rust, Python, C++, Java, Scala, C#, Kotlin, or TypeScript/JavaScript
- Experience with cloud infrastructure (AWS, GCP, or Azure) and containerisation/orchestration tools such as Docker and Kubernetes
- Strong experience designing production-grade backend services, APIs, and distributed systems
- Knowledge of networking, data streaming, caching, and performance optimisation in distributed systems
- Excellent collaboration and communication skills
- Ability to commit 30-40 hours per week, including required synchronous collaboration sessions
- Familiarity with LLM inference pipelines, agent frameworks, multi-agent architectures, or reinforcement learning environments is a strong plus