Design, develop, and deploy scalable, high-performance production-grade backend services and distributed systems to support large-scale model inference.
Make significant contributions to the technical roadmap and architecture of our inference platform, with a focus on low-latency, high-throughput services.
Ensure the reliability, scalability, and efficiency of our production systems using monitoring and observability tools such as Prometheus and Grafana.
Collaborate with data science, product, and engineering teams to align platform capabilities with the company's strategic goals.
Manage and optimize our cloud infrastructure (GCP) and orchestrate workloads using Kubernetes.
Advocate and implement best practices for development, testing, deployment, and monitoring of backend services (DevOps, SRE).
Requirements
A degree or equivalent experience in software development or a related technical field.
Strong hands-on experience designing, deploying, and maintaining large-scale distributed systems, including managing cloud infrastructure on GCP and orchestrating workloads with Kubernetes.
Expertise in Go (Golang) for building high-performance, low-latency systems and infrastructure.
Extensive experience with monitoring and observability tools (e.g., Prometheus, Grafana).
Familiarity with microservices architectures, containerization (Docker), and CI/CD best practices.
Excellent communication and collaboration skills, with the ability to explain technical concepts to both technical and non-technical stakeholders.
Tech Stack
Cloud
Docker
Google Cloud Platform
Grafana
Kubernetes
Microservices
Prometheus
Go
Benefits
Relocation support is not available for this position.