Design, build, and maintain scalable machine learning pipelines for model deployment, validation, monitoring, and lifecycle management
Implement model versioning, drift detection, and continuous retraining workflows to ensure model accuracy and compliance
Collaborate with data scientists, platform engineers, and security teams to ensure reliable, secure, and efficient delivery of AI/ML capabilities
Develop and maintain systems engineering and cybersecurity artifacts for the System
Prepare, maintain, and execute a System Engineering Plan (SEP) for managing all systems architecture and system engineering related aspects of the program
Conduct systems engineering activities required to specify, build, and maintain system engineering designs for the System
Design, engineer, integrate, and continuously improve the underlying infrastructure of the System including cloud environment, network, data storage, logging, and auditing functions
Define, document, maintain, and promulgate APIs and technical standards for using and interoperating within and outside the System
Establish and maintain integrations with external model providers, making their available models accessible via API
Provide Tier-4 support for any critical issues with the available services and products, in accordance with defined SLAs
Design, architect, engineer, and continuously improve all aspects of cybersecurity elements of the System
Perform site reliability engineering to build and maintain a reliable, scalable, and efficient System by applying software engineering principles to operational tasks
Participate in the Engineering Control Board (ECB) process for supporting all major engineering milestones and decisions for the program
Requirements
Bachelor’s degree in Computer Science, Engineering, or a related field
Minimum of 8 years of experience in machine learning operations (MLOps) or related fields
Experience with cloud platforms that host and manage infrastructure such as AWS, Azure, or Google Cloud
Proficiency in programming languages such as Python, Java, or C++
Experience with containerization and orchestration tools like Docker and Kubernetes
Strong understanding of machine learning model lifecycle management
Experience with CI/CD pipelines and version control systems like Git
Top Secret clearance required to start
Strong problem-solving skills and ability to work in a collaborative environment