Home
Jobs
Saved
Resumes
Machine Learning Operations Engineer at Nuvei | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Machine Learning Operations Engineer
Nuvei
Website
LinkedIn
Machine Learning Operations Engineer
Israel
Full Time
2 hours ago
No Sponsorship
Apply Now
Key skills
AWS
Azure
Cloud
Docker
Google Cloud Platform
Grafana
Kubernetes
Microservices
Prometheus
Python
Ray
Spark
Terraform
Unity
Bash
ML
LLM
RAG
MLOps
MLflow
Databricks
dbt
FastAPI
GCP
Google Cloud
OpenTelemetry
Service Mesh
Caching
CI/CD
A/B Testing
About this role
Role Overview
Operate & Develop ML/LLM platforms on Kubernetes + cloud (Azure; AWS/GCP ok) with Docker, Terraform, and other relevant tools
Manage object storage, GPUs, and autoscaling for training & low-latency model serving
Manage cloud environment, networking, service mesh, secrets, and policies to meet PCI-DSS and data-residency requirements
Build end-to-end CI/CD for models/agents/MCP tooling (versioning, tests, approvals)
Deliver real-time fraud/risk scoring & agent signals under strict latency SLOs.
Maintain MCP servers/clients: tool/resource definitions, versioning, quotas, isolation, access controls
Integrate agents with microservices, event streams, and rule engines; provide SLAs, tracing, and on-call runbooks
Measure operational metrics of ML/LLM (latency, throughput, cost, tokens, tool success, safety events)
Enforce governance: RBAC/ABAC, row-level security, encryption, PII/secrets management, audit trails.
Partner with DS on packaging (wheels/conda/containers), feature contracts, and reproducible experiments.
lead incident response and post-mortems.
Drive FinOps: right-sizing, GPU utilization, batching/caching, budget alerts.
Requirements
4+ years in DevOps/MLOps/Platform roles building and operating production ML systems (batch and real-time)
Strong hands-on with Kubernetes, Docker, Terraform/IaC, and CI/CD
Practical experience with Spark/Databricks and scalable data processing
Proficiency in Python & Bash
Ability to operate DS code and optimize runtime performance.
Experience with model registries (MLflow or similar), experiment tracking, and artifact management.
Production model serving using FastAPI/Ray Serve/Triton/TorchServe, including autoscaling and rollout strategies
Monitoring and tracing with Prometheus/Grafana/OpenTelemetry; alerting tied to SLOs/SLAs
Solid understanding of PCI-DSS/GDPR considerations for data and ML systems
Experience with the Azure cloud environment is a big plus
Operating LLM/agent workloads in production (prompt/config versioning, tool execution reliability, fallback/retry policies)
Building/maintaining RAG stacks (indexing pipelines, vector DBs, retrieval evaluation, hybrid search)
Implementing guardrails (policy checks, content filters, allow/deny lists) and human-in-the-loop workflows
Experience with feature stores
Qwak Feature Store, Feast
A/B testing for models and agents, offline/online evaluation frameworks
Payments/fraud/risk domain experience; integrating ML outputs with rule engines and operational systems
Advantage
Familiarity with Databricks Unity Catalog, dbt, or similar tooling
Tech Stack
AWS
Azure
Cloud
Docker
Google Cloud Platform
Grafana
Kubernetes
Microservices
Prometheus
Python
Ray
Spark
Terraform
Unity
Benefits
Private Medical Insurance
Office and home hybrid working
Global bonus plan
Volunteering programs
Prime location office close to Tel Aviv train station
Apply Now
Home
Jobs
Saved
Resumes