Archetype AI is developing an innovative AI platform aimed at integrating AI into the physical world, driven by a team with backgrounds from Google. The Site Reliability Engineer (SRE) will design, scale, and maintain the infrastructure for AI-driven products, ensuring high availability and performance while collaborating with engineering and ML teams.
Responsibilities:
- Design, build, and operate highly available distributed systems
- Collaborate with engineering and ML teams to ensure reliable deployment of backend services (in Rust, C++ or similar)
- Implement monitoring, alerting, and observability solutions across infrastructure
- Automate deployments, scaling, and infrastructure provisioning using infrastructure-as-code
- Diagnose and resolve performance bottlenecks, system outages, and production incidents
- Support AI/ML infrastructure for training and serving models at scale, including GPU clusters, pipelines, and inference services
- Contribute to infrastructure architecture, standards, and operational best practices
Requirements:
- 5+ years of experience as SRE, DevOps, or Systems Engineer
- Strong expertise in distributed systems, fault-tolerant architectures, and large-scale production environments
- Proficiency in Rust, C++, or other backend languages with willingness to learn
- Solid experience with Kubernetes, containers, and cloud platforms (AWS, GCP, Azure)
- Hands-on experience with monitoring and observability tools (Prometheus, Grafana, ELK, OpenTelemetry)
- Experience with data pipelines, messaging systems, and streaming technologies (Kafka, Pulsar, etc.)
- Familiarity with AI/ML infrastructure (training pipelines, GPU clusters, inference systems)
- Strong debugging, problem-solving, and automation mindset (Terraform, Ansible, Pulumi, scripting)
- Excellent communication and collaboration skills
- Experience with real-time or low-latency systems
- Open-source contributions to distributed systems or infrastructure projects
- Knowledge of security best practices for distributed environments
- Experience with edge or embedded systems and sensor-based infrastructure
- Background in multimodal data fusion or physical-world perception systems