Scale AI is building the infrastructure that makes enterprise AI seamless. They are seeking a Senior or Staff Infrastructure Engineer to act as a primary technical lead, engineering the 'paved road' for knowledge retrieval and inference engines while ensuring the platform remains reliable for enterprise agents.
Responsibilities:
- Architect multi-cloud systems and abstractions to allow the SGP platform to run on top of existing Cloud providers
- Use our own data and AI platform to analyze build and test logs and metrics to identify areas for improvement
- Define the architectural patterns for our multi-cloud infrastructure to support secure, reliable, and scalable Agentic workflows for enterprise customers
- Enhance engineering and infrastructure efficiency, reliability, accuracy, and response times, including CI/CD processes, test frameworks, data quality assurance, end-to-end reconciliation, and anomaly detection
- Collaborate with platform and product teams to develop and implement innovative infrastructure that scales to meet evolving needs
- Design and champion highly scalable, reliable, and low-latency infrastructure and frameworks for building, orchestrating, and evaluating multi-agent systems at enterprise scale
- Lead the infrastructure roadmap with a strong focus on compliance, privacy, and security standards, including designing change management and data isolation strategies
- Own the development and maintenance of our best-in-class Agentic observability platform (logging, metrics, tracing, and analytics) to proactively ensure system health and enable rapid incident response
- Drive developer efficiency by building automated tooling and championing Infrastructure-as-Code (IaC) paradigms throughout the engineering organization to improve workflows and operational efficiency
Requirements:
- Proven experience in a senior role, with 5+ years of full-time software engineering experience
- Deep understanding of modern infrastructure practices, including CI/CD, IaC (e.g., Terraform, Helm Charts), container orchestration (e.g., Kubernetes) and observability platforms (e.g., Datadog, Prometheus, Grafana)
- Extensive experience with at least one major cloud provider (AWS, Azure, or GCP)
- Strong knowledge of security and compliance in enterprise environments, with a focus on access management, data isolation, and customer-specific VPC setups
- Proficiency in Python or JavaScript/TypeScript, and SQL
- Hands-on experience and a passion for working with Agents, LLMs, vector databases, and other emerging AI technologies