ARKA Group L.P. is an advanced technologies company serving the U.S. military, intelligence community, and commercial space industry. They are seeking a Senior MLOps Platform Engineer to design and operate a unified MLOps platform that supports both on-premise and AWS environments, enabling the development of Agentic AI products.

Responsibilities:

Design, implement, and operate a unified MLOps platform that supports both on-premise Kubernetes clusters and AWS. The platform should enable rapid onboarding of new Agentic AI services and provide consistent governance across environments
Develop reusable CI/CD pipelines (GitLab CI) for model packaging, containerization, automated testing, canary releases, and rollbacks
Build observability, monitoring, and alerting stacks (Prometheus, Grafana, OpenTelemetry, CloudWatch) to track inference latency, throughput, resource utilization, and data drift for real time and batch workloads
Create self-service tooling (CLI, SDKs, UI dashboards) that allows data science and product teams to register models, define inference endpoints, and manage versioning without deep DevOps involvement
Architect and maintain data pipelines that feed training data, model artifacts, and inference logs into a governed data lake (S3, on prem object store)
Collaborate with research and product engineers to translate experimental Agentic AI prototypes into production grade services, ensuring reproducibility, security, and compliance
Drive performance optimization for inference workloads (GPU/CPU scaling, model quantization, batching strategies)
Champion best practices in security (IAM, network policies, secret management), cost efficiency, and disaster recovery for the hybrid infrastructure
Mentor junior engineers and contribute to internal knowledge bases, upskilling, and review processes

Requirements:

BS in computer science or related engineering field
5+ years of experience building and operating production grade software infrastructure, preferably in a hybrid onprem / cloud environment
Deep expertise with Kubernetes (cluster provisioning, Helm, operators, custom resources) and container runtimes (Docker, OCI)
Hands on experience with AWS services (EKS, SageMaker, S3, IAM, CloudWatch, Step Functions) and the ability to bridge onprem resources with AWS via VPN/Direct Connect
Strong software engineering skills in Python and at least one compiled language (Go, Rust, or Java) for building platform components and SDKs
Proficiency with CI/CD and GitOps tooling (Argo CD, Flux, Gitlab, GitHub Actions, or similar)
Solid understanding of distributed systems (consensus, fault tolerance, load balancing) and experience tuning high throughput, low latency inference pipelines
Experience with data engineering frameworks (Airflow, Prefect, Kafka, Spark, Flink) and building robust, versioned data pipelines
Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry, ELK) and the ability to define meaningful SLIs/SLOs for AI services
Track record of collaborating with research or product teams to move prototypes to production, translating experimental code into maintainable services
Strong problem solving mindset, excellent written and verbal communication, and a passion for building scalable AI platforms
Working knowledge of Scrum and Agile software development methodology

Senior MLOps Platform Engineer {S}

Key skills

About this role

Responsibilities:

Requirements: