Cloudera is a leading company in data management and cloud innovation, seeking a Staff Backend Engineer to join their Anywhere Cloud team. The role involves architecting and improving scalable backend systems, driving performance and reliability across Kubernetes-based services, and mentoring engineers while managing project priorities.
Responsibilities:
- Architect, build, and improve scalable backend systems and APIs
- Drive performance, reliability, and security across kubernetes based backend services
- Implement robust testing frameworks, including unit, regression, and end-to-end tests, to guarantee deterministic and predictable behavior from our AI-powered data platform
- Establish safety guardrails and human-in-the-loop processes to maintain accuracy and ensure the production of ethical, responsible, and non-toxic outputs
- Optimize for cost & performance: Instrument, analyze, and optimize unit economics (token usage, caching, batching, distillation) and performance (p95 latency, throughput, autoscaling)
- Drive data excellence: Shape data contracts, feedback loops, labeling strategies, and feature stores to continuously improve model and workflow quality
- Mentor and multiply: Provide technical leadership across teams, unblock complex projects, raise code/design standards, and mentor senior engineers
- Partner across functions: Translate product intent into technical plans, influence roadmaps with data-driven insights, and communicate trade-offs to executives and stakeholders
Requirements:
- 6+ years of software engineering experience building large-scale distributed production systems
- Expertise in at least one primary language (Go preferred) and ecosystem (eg: Rust) and cloud-native architectures (containers, service mesh, queues, eventing)
- Proven expertise in advanced Kubernetes design and operation, including optimizing performance (e.g., node affinity, resource limits, horizontal pod autoscaling), service mesh implementation, and custom resource definition (CRD) development
- Experience designing reusable AI workflow primitives, SDKs, or internal platforms used by multiple product teams
- Built robust tracing/metrics/logging for AI systems; familiarity with quality dashboards and prompt diff tooling
- Experience with managing machine learning workloads on container orchestration platforms like Kubernetes, including setting up GPU resources, managing distributed training jobs, and deploying models at scale
- Familiarity with the AI/ML ecosystem: You understand the fundamentals of LLMs, vector databases, RAG, and prompt engineering
- Familiarity with tools such as MLflow, LangChain, or Hugging Face is a significant advantage
- Security & privacy mindset: Familiarity with data governance, PII handling, tenant isolation, and compliance considerations