ARKA Group L.P. is an advanced technologies company serving the U.S. military, intelligence community, and commercial space industry. They are seeking a Senior MLOps Platform Engineer to design and operate a unified MLOps platform that supports both on-premise and AWS environments, enabling the development of Agentic AI products.
Responsibilities:
- Design, implement, and operate a unified MLOps platform that supports both on-premise Kubernetes clusters and AWS. The platform should enable rapid onboarding of new Agentic AI services and provide consistent governance across environments
- Develop reusable CI/CD pipelines (GitLab CI) for model packaging, containerization, automated testing, canary releases, and rollbacks
- Build observability, monitoring, and alerting stacks (Prometheus, Grafana, OpenTelemetry, CloudWatch) to track inference latency, throughput, resource utilization, and data drift for real time and batch workloads
- Create self-service tooling (CLI, SDKs, UI dashboards) that allows data science and product teams to register models, define inference endpoints, and manage versioning without deep DevOps involvement
- Architect and maintain data pipelines that feed training data, model artifacts, and inference logs into a governed data lake (S3, on prem object store)
- Collaborate with research and product engineers to translate experimental Agentic AI prototypes into production grade services, ensuring reproducibility, security, and compliance
- Drive performance optimization for inference workloads (GPU/CPU scaling, model quantization, batching strategies)
- Champion best practices in security (IAM, network policies, secret management), cost efficiency, and disaster recovery for the hybrid infrastructure
- Mentor junior engineers and contribute to internal knowledge bases, upskilling, and review processes
Requirements:
- BS in computer science or related engineering field
- 5+ years of experience building and operating production grade software infrastructure, preferably in a hybrid onprem / cloud environment
- Deep expertise with Kubernetes (cluster provisioning, Helm, operators, custom resources) and container runtimes (Docker, OCI)
- Hands on experience with AWS services (EKS, SageMaker, S3, IAM, CloudWatch, Step Functions) and the ability to bridge onprem resources with AWS via VPN/Direct Connect
- Strong software engineering skills in Python and at least one compiled language (Go, Rust, or Java) for building platform components and SDKs
- Proficiency with CI/CD and GitOps tooling (Argo CD, Flux, Gitlab, GitHub Actions, or similar)
- Solid understanding of distributed systems (consensus, fault tolerance, load balancing) and experience tuning high throughput, low latency inference pipelines
- Experience with data engineering frameworks (Airflow, Prefect, Kafka, Spark, Flink) and building robust, versioned data pipelines
- Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry, ELK) and the ability to define meaningful SLIs/SLOs for AI services
- Track record of collaborating with research or product teams to move prototypes to production, translating experimental code into maintainable services
- Strong problem solving mindset, excellent written and verbal communication, and a passion for building scalable AI platforms
- Working knowledge of Scrum and Agile software development methodology