AppFolio is a technology leader transforming the real estate industry with its AI-native platform. They are seeking a Staff Machine Learning Engineer to advance their ML platform, focusing on infrastructure, cost optimization, and collaboration with engineering teams to produce scalable AI-powered solutions.
Responsibilities:
- Design and operate AppFolio's ML infrastructure on AWS — ECS, SageMaker, GPU fleets, model serving, autoscaling, and cost controls
- Optimize cost across all AI applications — provider routing, caching, batch vs. real-time, model size selection, and inference economics
- Maintain reliable, multi-provider LLM access across Google, OpenAI, and Anthropic with sensible fallbacks and abstractions
- Build the training and fine-tuning stack for Small Language Models, including data pipelines, GPU orchestration, and evaluation
- Partner with Voice & Agents and Research ML engineers to harden their prototypes into production systems with SLOs, on-call rotations, and observability
- Operate AppFolio's AI safety and authorization layer — guardrails on AWS, scoped tool permissions, and human-in-the-loop gates for autonomous agent actions
Requirements:
- Systems thinker: You think in terms of platforms and long-term leverage, not just features
- Production builder: You've built and scaled ML infrastructure in production with meaningful business impact
- Ambiguity: You operate effectively in high ambiguity, turning unclear infra problems into clear direction
- Owner-operator: You take ownership with a founder/owner-operator mindset, act with urgency, and focus on outcomes
- Pace: You have a strong desire to move fast and deliver impact, while maintaining sound engineering judgment
- Collaboration: You are humble, collaborative, and low-ego, and you elevate those around you
- Sustainability: You value work-life balance as a foundation for sustained high performance
- Reliability mindset: You treat ML infra like any other production system — SLOs, on-call, observability, postmortems
- ML infra at scale: Has built and operated production ML infrastructure on AWS — ECS, SageMaker, GPUs, autoscaling, and cost controls
- Inference platforms: Production experience with model serving for both LLMs and custom models; understands quantization, batching, and routing
- Provider breadth: Direct experience integrating with Google (Vertex / Gemini), OpenAI, and Anthropic APIs in production
- Training capability: Has trained or fine-tuned language models end-to-end; comfortable with deep learning, evaluation, and inference
- Cloud-native engineering: Strong Python, Docker, dependency management, and CI/CD for AI workloads
- RAG & agents: Working knowledge of LangChain / LangGraph and modern RAG patterns over structured and unstructured data
- Cost optimization: Demonstrated experience reducing unit cost of AI workloads without regressing quality or latency
- AI safety & authorization: Hands-on experience operating AI guardrails, scoped tool permissions, and authorization layers for production AI systems
- Experience training Small Language Models for production use
- GPU performance tuning (vLLM, TensorRT, Triton, or similar)
- Prior Staff-level role at a company with a significant AI infra footprint
- Experience with ontology-driven systems or knowledge graphs supporting AI applications
- Contributions to open-source ML infrastructure or LLM tooling