Mozilla Corporation is a non-profit-backed technology company that has shaped the internet over the last 25 years, focusing on diverse areas including AI and security. They are seeking a Senior Machine Learning Engineer to design, build, and operate the AI platform, ensuring efficient and secure deployment of machine learning models across their products.

Responsibilities:

Design, build, and operate core AI platform components used to train, deploy, and serve machine learning models in production environments
Own model serving and inference workflows end-to-end, driving improvements in reliability, scalability, performance, and operational excellence
Lead efforts to optimize inference systems for throughput, latency, and cost efficiency across CPU and GPU workloads
Design and manage GPU-based inference and training workloads, including performance tuning, capacity planning, and resource utilization optimization
Own and improve critical parts of the model lifecycle, including packaging, versioning, testing strategies, validation, and deployment automation
Implement and evolve observability practices (metrics, logging, tracing, alerting) to improve visibility and operational resilience of ML services and pipelines
Partner closely with product, infrastructure, security, and data teams to design scalable platform capabilities that enable AI-powered features
Contribute to technical design discussions, propose architectural improvements, and mentor junior engineers through code reviews and knowledge sharing
Participate in and help improve operational processes, including incident response, on-call rotations, and post-incident reviews

Requirements:

Bachelor's degree with 4–6 years of relevant industry experience, or Master's degree with significant hands-on experience building and operating production ML systems, or work experience equivalent
Strong experience developing in Python for machine learning systems, backend services, or distributed data processing
Proven experience deploying and operating ML workloads in cloud environments, including production-grade infrastructure
Solid understanding of model serving architectures, inference pipelines, and performance tradeoffs (latency, throughput, cost, scaling strategies)
Hands-on experience working with GPU-based workloads and accelerated computing in production settings
Experience designing CI/CD pipelines and development workflows that support reliable ML system deployment
Ability to independently scope and drive technical initiatives while balancing product and operational priorities
Strong problem-solving skills and the ability to debug performance and reliability issues in distributed systems
Clear and effective communication skills, with experience collaborating across engineering, product, and infrastructure teams
Experience implementing inference optimization strategies such as batching, quantization, compilation, model conversion, or hardware-specific tuning
Familiarity with containerization and orchestration systems (e.g., Docker, Kubernetes) in production environments
Experience designing observability systems for distributed services, including metrics strategy and performance profiling
Exposure to privacy-preserving ML techniques, security best practices, or responsible AI system design
Contributions to open-source ML infrastructure projects or leadership in building reusable internal ML tooling

Senior Machine Learning Engineer, AI Platform

Key skills

About this role

Responsibilities:

Requirements: