MAASco Tech LLC is seeking an experienced MLOps Platform Engineer to design, build, and support enterprise-grade machine learning operations capabilities. This hands-on engineering role involves managing MLOps platform components, overseeing model deployment, and collaborating with data teams to operationalize machine learning solutions.
Responsibilities:
- Engineer, manage, and support MLOps platform components across AWS and EKS-based environments
- Oversee deployment, configuration, and operation of infrastructure used for ML training, batch inference, and real-time model serving
- Ensure platform availability, resilience, and performance across dev, test, and production environments
- Implement role-based access controls (RBAC), network policies, and scalable namespace designs within EKS
- Build and support CI/CD pipelines (GitLab) for model packaging, container image builds, vulnerability scanning, and automated deployment flows
- Enable standardized model release processes including environment promotion, versioning, and rollback workflows
- Integrate CI/CD with ML frameworks, model repositories, artifacts, and runtime environments
- Design and manage EKS workloads supporting containerized ML jobs and microservices
- Implement auto-scaling, resource quotas, cluster optimization, and multi-tenant workload isolation
- Support GPU and CPU-based training/inference workloads
- Implement logging, monitoring, and alerting for ML pipelines, model endpoints, batch jobs, and platform components
- Analyze compute, storage, and data transfer usage to optimize cost efficiency across ML workloads
- Perform incident response, root cause analysis, and long-term remediation planning
- Partner with Data Scientists, ML Engineers, and application teams to operationalize end-to-end machine learning solutions
- Provide technical guidance on best practices for ML model lifecycle management, deployment patterns, and scalable architectures
- Contribute to documentation, runbooks, onboarding materials, and internal knowledge bases
Requirements:
- 3+ years of hands-on experience with AWS services, including EKS, EC2, S3, IAM, CloudWatch, and ECR
- Strong experience operating and troubleshooting Kubernetes (preferably AWS EKS)
- Proficiency in containerization (Docker) and orchestration concepts
- Strong programming/scripting experience in Python and Bash
- Experience building and managing CI/CD pipelines (GitLab or equivalent)
- Familiarity with machine learning workflows, including training, inference, and model monitoring
- Experience with infrastructure-as-code (Terraform or CloudFormation)
- Experience supporting production platforms, including incident management and root cause analysis
- Experience managing Data Analytics Platforms / Tools (e.g., Domino, SageMaker)
- Experience with ML lifecycle tools such as MLflow, or similar
- Experience supporting GPU-based workloads or distributed training environments
- Familiarity with enterprise MLOps architectures and patterns (batch, real-time, microservices)
- Understanding of data processing frameworks and feature pipelines