Oracle is a leading company in AI and cloud solutions, and they are seeking a Senior Principal AI Agent / ML Software Engineer to define, build, and operate next-generation AI systems on Oracle Cloud Infrastructure (OCI). This role involves setting technical direction for AI platforms, leading multi-team execution, and ensuring the reliability and scalability of AI systems in business-critical environments.
Responsibilities:
- Serve as a senior technical owner for OCI AI platform capabilities, including agent execution, inference systems, model serving, AI workflow orchestration, evaluation, and observability
- Design, architect, and deliver scalable agentic AI systems capable of reasoning, planning, tool use, workflow execution, multi-step task orchestration, and safe human-in-the-loop escalation
- Build production-grade services for tool calling, agent memory, context management, Model Context Protocol (MCP) integration, vector retrieval, multi-agent coordination, policy enforcement, and evaluation
- Lead architecture across distributed services optimized for low latency, high throughput, GPU efficiency, reliability, cost, operability, and secure multi-tenant operation
- Define service boundaries, APIs, data models, state management, consistency tradeoffs, failure modes, SLIs/SLOs, rollout strategies, and operational readiness criteria for AI platform services
- Drive technical strategy across infrastructure, platform, security, data, and application engineering teams, converting broad goals into executable multi-quarter plans and measurable milestones
- Integrate AI agents securely and reliably with enterprise APIs, cloud services, databases, identity systems, secrets management, and external systems
- Establish AgentOps and LLMOps practices for tracing, monitoring, eval suites, regression testing, experimentation, safety guardrails, prompt/tool versioning, and production reliability
- Evaluate and operationalize emerging technologies in generative AI, agentic workflows, inference optimization, long-context systems, reasoning models, AI developer tooling, and agentic-first development
- Drive engineering excellence through code reviews, design reviews, test strategy, deployment automation, incident analysis, documentation, and AI-assisted development practices using tools such as Codex, Claude Code, Cursor, Copilot, or similar systems
- Mentor Staff and senior engineers, raise architectural standards, and influence engineering practices across OCI without requiring direct management authority
- Own critical production outcomes, including reliability, performance, security posture, cost efficiency, and supportability for the systems delivered
Requirements:
- Bachelor's, Master's, or Ph.D. in Computer Science, AI/ML, Engineering, or a related field, or equivalent practical experience
- 12+ years of professional software engineering experience, including significant ownership of production systems; or equivalent experience demonstrating Senior Staff / Principal-level impact
- Proven track record as a Staff, Senior Staff, Principal, or equivalent technical leader influencing architecture and execution across multiple teams
- Deep experience designing, building, and operating high-scale distributed systems, cloud services, infrastructure platforms, or AI/ML platform services
- Hands-on experience with production AI systems, agentic AI applications, autonomous workflows, tool-using agents, multi-step orchestration, or multi-agent systems
- Practical experience with orchestration frameworks such as LangGraph, LangChain, CrewAI, AutoGen, LlamaIndex, or similar ecosystems
- Deep understanding of LLM application patterns, including prompt design, structured outputs, function/tool calling, context management, RAG, memory, tool safety, and evaluation
- Strong programming skills in Python and ability to contribute high-quality production code, reviews, tests, and debugging in complex distributed environments
- Strong expertise with Kubernetes, Docker, cloud-native infrastructure, service-to-service communication, scalability, fault tolerance, observability, and performance analysis
- Experience defining SLIs/SLOs, production readiness criteria, incident response practices, monitoring, tracing, experiments, and reliability programs for AI or distributed systems
- Strong understanding of AI safety, governance, security, and operational risks for autonomous or semi-autonomous systems, including data handling, access control, auditability, and human accountability
- Excellent written and verbal communication, with demonstrated ability to lead technical direction, resolve ambiguity, and influence senior stakeholders
- Experience optimizing large-scale GPU inference or training workloads for latency, throughput, utilization, availability, and cost
- Experience building or operating model serving, inference gateways, agent runtimes, workflow engines, developer platforms, or internal AI productivity platforms
- Experience integrating AI systems with enterprise APIs, databases, cloud services, vector databases, embeddings, retrieval systems, identity systems, and policy enforcement layers
- Experience with LLM fine-tuning, long-context systems, reasoning models, model routing, caching, batching, quantization, or emerging generative AI research
- Experience building evaluation frameworks for agentic systems, including offline evals, online experiments, golden tasks, adversarial testing, regression gates, and observability dashboards
- Experience using AI-assisted software development tools such as Codex, Claude Code, Cursor, Copilot, or similar systems in large-scale engineering environments
- Track record of defining architectural standards, platform capabilities, or engineering practices adopted across multiple teams or organizations
- Experience in enterprise, cloud infrastructure, regulated, security-sensitive, or mission-critical environments