Oracle is a leading enterprise software company focused on cloud solutions and artificial intelligence. In this role, you will drive the development of AI software and infrastructure to support large-scale GPU clusters and generative AI models, working in a collaborative and innovative environment.
Responsibilities:
- Design and develop AI software in Java, Python, and other languages
- Participate in the entire software lifecycle – development, testing, CI/CD and production operations
- Participate in the entire model development cycle - training, fine-tuning, model serving, evaluation/benchmarking and human preference learning
- Apply engineering principles for defining robust and maintainable architectures and designs
- Build cloud service on top of the modern Infrastructure as Service (IaaS) building blocks at OCI
- Design and build distributed, scalable, fault tolerant software systems to facilitate development of GenAI models
- Identify requirements, scope solutions, estimate work, schedule deliverables. Help establish and drive the adoption of outstanding coding standards and patterns and help enhance our inclusive engineering culture
- Contribute to publications, blogs and open-source ML performance submissions partnering with product managers
- Balance between product feature development and production operational concerns like ops automation, structured logging, instrumentation for metrics and participating in on-call
Requirements:
- BS/MS in Computer Science or equivalent experience
- 6–10+ years building and shipping enterprise distributed or cloud-native systems
- Experience scaling heterogeneous CPU/GPU training infrastructure for large multimodal frontier models
- Strong foundation in system design, distributed systems, and cloud architecture best practices
- Proficiency in Java, Python, or similar object-oriented languages
- Experience building highly available services using service-oriented design patterns and service-to-service communication protocols
- Proven ability to deliver impact in collaborative, fast-paced environments
- Strong verbal and written communication skills, including technical design documentation
- Hands-on experience with containers and orchestration technologies such as Kubernetes and Docker
- Production experience with Cloud and ML technologies
- Experience working in the below areas and algorithms will be ideal but not mandatory: Generative AI Modeling: Customizing LLM's, build and deploy LLM's at scale for large scale data generation
- Algorithms: Transformer models, Attention mechanism, Prompt tooling