Cloudera is a company that empowers people to transform complex data into actionable insights. They are seeking a Staff Full Stack Software Engineer to lead the architecture and delivery of AI-powered workflows, collaborating with various teams to ensure high-impact features are delivered at scale.
Responsibilities:
- Own the architecture: Design, evolve, and document the end-to-end AI workflow stack (prompting, retrieval, tools/function-calling, agents, orchestration, evaluation, observability, and safety) with clear interfaces, SLAs, and versioning
- Ship production systems: Build reliable, low-latency services that integrate foundation models (hosted and self-hosted), and traditional microservices
- Own end-to-end delivery of features from the user-facing aspect (UI) to the backend services
- Implement robust testing frameworks, including unit, regression, and end-to-end tests, to guarantee deterministic and predictable behavior from our AI-powered data platform. Establish safety guardrails and human-in-the-loop processes to maintain accuracy and ensure the production of ethical, responsible, and non-toxic outputs
- Optimize for cost & performance: Instrument, analyze, and optimize unit economics (token usage, caching, batching, distillation) and performance (p95 latency, throughput, autoscaling)
- Drive data excellence: Shape data contracts, feedback loops, labeling strategies, and feature stores to continuously improve model and workflow quality
- Mentor and multiply: Provide technical leadership across teams, unblock complex projects, raise code/design standards, and mentor senior engineers
- Partner across functions: Translate product intent into technical plans, influence roadmaps with data-driven insights, and communicate trade-offs to executives and stakeholders
Requirements:
- Bachelor's degree in Computer Science or equivalent, and 6+ years of experience
- Expertise in at least one primary language (Rust preferred) and ecosystem (e.g., Python, Go, or Java) and cloud-native architectures (containers, service mesh, queues, eventing)
- Proven experience in integrating AI/ML models into user interfaces. This is more than just calling an API; you should have experience building features like AI-powered assistants, natural language interfaces (e.g., text-to-SQL), proactive suggestions, or intelligent data visualization
- Familiarity with the AI/ML ecosystem: You understand the fundamentals of LLMs, vector databases, RAG, and prompt engineering. Familiarity with tools such as MLflow, LangChain, or Hugging Face is a significant advantage
- Security & privacy mindset: Familiarity with data governance, PII handling, tenant isolation, and compliance considerations
- Platform thinking: Experience designing reusable AI workflow primitives, SDKs, or internal platforms used by multiple product teams
- Model ops: Experience with model lifecycle management, feature/embedding stores, prompt/version management, and offline/online eval systems
- Search & data infra: Experience with vector databases (e.g., Pinecone, Weaviate, pgvector), retrieval strategies, and indexing pipelines
- Observability: Built robust tracing/metrics/logging for AI systems; familiarity with quality dashboards and prompt diff tooling
- Cost strategy: Experience with model selection, distillation, caching layers, router policies, and autoscaling to manage spend
- Experience with managing machine learning workloads on container orchestration platforms like Kubernetes, including setting up GPU resources, managing distributed training jobs, and deploying models at scale