Scale Addi’s competitive advantage by building a world-class ML Ops foundation that accelerates the transition from model prototype to production, while ensuring our AI systems from credit scoring to generative agents are resilient, cost-efficient, and seamlessly integrated into our core financial product.
Ensure ML/AI systems can be served reliably in production, maintaining strong operational excellence for availability, latency, and incident response, in partnership with the Data Scientist role for model/agent logic and iteration.
Build and maintain the serving and integration layer for ML/AI solutions (APIs, connectors, asynchronous execution patterns), enabling seamless integration with internal systems and Ops tooling.
Establish clear mechanisms for monitoring and reliability of ML/AI systems in production (dashboards, alerts, core KPIs, regression detection, and data/feature quality checks).
Enable repeatable delivery for ML/AI services through strong engineering practices (CI/CD, testing, release strategies, rollback, and operational runbooks).
Make contributions to our Architecture Decision Records repository by evaluating and proposing platform upgrades for ML/AI systems (e.g., feature serving patterns, workflow orchestration, scalable storage) to improve reliability, scalability, and reuse.
Requirements
Proven experience in architecting and serving production-grade ML systems
4–7 years of experience in software engineering, with at least 3 years focused specifically on ML Ops or Data Engineering in a production environment
Demonstrates the ability to design high-availability serving layers using APIs (FastAPI, gRPC) and asynchronous execution patterns to handle high-concurrency fintech workloads
Possesses a deep understanding of the "handshake" between data science and engineering, ensuring models are packaged, versioned, and integrated into internal systems without friction
Expert-level knowledge of AWS (or similar), Kubernetes, Airflow/Prefect, and Databricks/Spark
Track record of implementing request batching and model quantization to balance high-performance throughput with infrastructure costs
Possesses strong technical fluency in the Python and Data ecosystem
Exhibits advanced Python engineering skills, moving beyond simple scripting to build modular, testable, and maintainable codebases
Expert-level knowledge of core ML libraries (NumPy, Pandas, scikit-learn) and at least one deep learning framework (PyTorch or TensorFlow)
Solid expertise in data-intensive stacks like Spark or Databricks and the ability to write complex, optimized SQL for feature extraction and data validation
Experienced in establishing mission-critical observability and reliability
Has a demonstrated ability to build comprehensive monitoring suites (logs, metrics, traces) that detect not just system downtime, but ML-specific failures like data drift or feature quality regressions
Track record of leading incident response and post-mortems, with a focus on reducing Mean Time to Recovery (MTTR) for model-related production issues
Proven ability to implement automated alerting and regression detection that prevents degraded models from impacting the end-customer experience
Demonstrates a mastery of ML orchestration and engineering best practices
Proven experience in building repeatable CI/CD pipelines for ML (MLOps), including automated testing, canary releases, and seamless rollback strategies
Has solid expertise in workflow orchestration tools (e.g., Airflow, Prefect) and storage patterns (Postgres, Vector DBs) required for complex ML lifecycles
Experienced in contributing to Architecture Decision Records (ADRs) to standardize feature serving patterns and scalable storage across the engineering org
Track record of building and scaling AI Agentic systems
Possesses practical experience with the components of modern AI agents, including RAG (Retrieval-Augmented Generation), orchestration frameworks (LangChain/LlamaIndex), and guardrail implementation
Demonstrates an understanding of the unique operational challenges of LLMs, such as token cost management, prompt versioning, and latency optimization
Experienced in evaluating and integrating graph-based architectures or graph databases when required for complex data relationship mapping
Exhibits exceptional cross-functional communication and ownership
Proven ability to translate highly technical infrastructure bottlenecks into clear business risks or opportunities for non-technical stakeholders
Demonstrates an "Ownership Mentality" by taking end-to-end responsibility for the reliability of the ML platform, from the initial architectural proposal to 2:00 AM incident resolution
Varies communication style effectively to mentor Data Scientists on engineering best practices while collaborating with Product Managers on roadmap feasibility.