AppFolio is a technology leader transforming the real estate industry with its AI-native platform. They are seeking a Staff Machine Learning Engineer to advance their ML platform, focusing on infrastructure, cost optimization, and collaboration with engineering teams to produce scalable AI-powered solutions.

Responsibilities:

Design and operate AppFolio's ML infrastructure on AWS — ECS, SageMaker, GPU fleets, model serving, autoscaling, and cost controls
Optimize cost across all AI applications — provider routing, caching, batch vs. real-time, model size selection, and inference economics
Maintain reliable, multi-provider LLM access across Google, OpenAI, and Anthropic with sensible fallbacks and abstractions
Build the training and fine-tuning stack for Small Language Models, including data pipelines, GPU orchestration, and evaluation
Partner with Voice & Agents and Research ML engineers to harden their prototypes into production systems with SLOs, on-call rotations, and observability
Operate AppFolio's AI safety and authorization layer — guardrails on AWS, scoped tool permissions, and human-in-the-loop gates for autonomous agent actions

Requirements:

Systems thinker: You think in terms of platforms and long-term leverage, not just features
Production builder: You've built and scaled ML infrastructure in production with meaningful business impact
Ambiguity: You operate effectively in high ambiguity, turning unclear infra problems into clear direction
Owner-operator: You take ownership with a founder/owner-operator mindset, act with urgency, and focus on outcomes
Pace: You have a strong desire to move fast and deliver impact, while maintaining sound engineering judgment
Collaboration: You are humble, collaborative, and low-ego, and you elevate those around you
Sustainability: You value work-life balance as a foundation for sustained high performance
Reliability mindset: You treat ML infra like any other production system — SLOs, on-call, observability, postmortems
ML infra at scale: Has built and operated production ML infrastructure on AWS — ECS, SageMaker, GPUs, autoscaling, and cost controls
Inference platforms: Production experience with model serving for both LLMs and custom models; understands quantization, batching, and routing
Provider breadth: Direct experience integrating with Google (Vertex / Gemini), OpenAI, and Anthropic APIs in production
Training capability: Has trained or fine-tuned language models end-to-end; comfortable with deep learning, evaluation, and inference
Cloud-native engineering: Strong Python, Docker, dependency management, and CI/CD for AI workloads
RAG & agents: Working knowledge of LangChain / LangGraph and modern RAG patterns over structured and unstructured data
Cost optimization: Demonstrated experience reducing unit cost of AI workloads without regressing quality or latency
AI safety & authorization: Hands-on experience operating AI guardrails, scoped tool permissions, and authorization layers for production AI systems
Experience training Small Language Models for production use
GPU performance tuning (vLLM, TensorRT, Triton, or similar)
Prior Staff-level role at a company with a significant AI infra footprint
Experience with ontology-driven systems or knowledge graphs supporting AI applications
Contributions to open-source ML infrastructure or LLM tooling

Staff Machine Learning Engineer

Key skills

About this role

Responsibilities:

Requirements: