Mozilla Corporation is a non-profit-backed technology company that has shaped the internet over the last 25 years, focusing on diverse areas including AI and security. They are seeking a Senior Machine Learning Engineer to design, build, and operate the AI platform, ensuring efficient and secure deployment of machine learning models across their products.
Responsibilities:
- Design, build, and operate core AI platform components used to train, deploy, and serve machine learning models in production environments
- Own model serving and inference workflows end-to-end, driving improvements in reliability, scalability, performance, and operational excellence
- Lead efforts to optimize inference systems for throughput, latency, and cost efficiency across CPU and GPU workloads
- Design and manage GPU-based inference and training workloads, including performance tuning, capacity planning, and resource utilization optimization
- Own and improve critical parts of the model lifecycle, including packaging, versioning, testing strategies, validation, and deployment automation
- Implement and evolve observability practices (metrics, logging, tracing, alerting) to improve visibility and operational resilience of ML services and pipelines
- Partner closely with product, infrastructure, security, and data teams to design scalable platform capabilities that enable AI-powered features
- Contribute to technical design discussions, propose architectural improvements, and mentor junior engineers through code reviews and knowledge sharing
- Participate in and help improve operational processes, including incident response, on-call rotations, and post-incident reviews
Requirements:
- Bachelor's degree with 4–6 years of relevant industry experience, or Master's degree with significant hands-on experience building and operating production ML systems, or work experience equivalent
- Strong experience developing in Python for machine learning systems, backend services, or distributed data processing
- Proven experience deploying and operating ML workloads in cloud environments, including production-grade infrastructure
- Solid understanding of model serving architectures, inference pipelines, and performance tradeoffs (latency, throughput, cost, scaling strategies)
- Hands-on experience working with GPU-based workloads and accelerated computing in production settings
- Experience designing CI/CD pipelines and development workflows that support reliable ML system deployment
- Ability to independently scope and drive technical initiatives while balancing product and operational priorities
- Strong problem-solving skills and the ability to debug performance and reliability issues in distributed systems
- Clear and effective communication skills, with experience collaborating across engineering, product, and infrastructure teams
- Experience implementing inference optimization strategies such as batching, quantization, compilation, model conversion, or hardware-specific tuning
- Familiarity with containerization and orchestration systems (e.g., Docker, Kubernetes) in production environments
- Experience designing observability systems for distributed services, including metrics strategy and performance profiling
- Exposure to privacy-preserving ML techniques, security best practices, or responsible AI system design
- Contributions to open-source ML infrastructure projects or leadership in building reusable internal ML tooling