Emergent Labs is a company focused on building autonomous coding agents for software development. The Software Engineer - Infrastructure will maintain platform stability, manage Kubernetes workloads, and enhance observability while collaborating with product and backend teams.
Responsibilities:
- Maintain stability of our platform consisting of distributed microservices closely interacting with Kubernetes and cloud providers (GCP, AWS)
- Manage Kubernetes workloads with ArgoCD (GitOps) — deploy, monitor, and troubleshoot application syncs, resource trees, and rollouts
- Debug and resolve complex Kubernetes issues across clusters
- Manage CDN and edge infrastructure (Cloudflare) for performance, caching, and traffic management
- Automate infrastructure lifecycle operations and workflows
- Own the observability stack: Grafana (dashboards, Loki logs, Prometheus metrics), New Relic (APM, golden metrics, transaction analysis)
- Enhance monitoring, alerting, and distributed tracing across services
- Participate in on-call rotation via PagerDuty, handle incident response, and perform root cause analysis
- Proactively identify reliability risks before they become incidents
- Support the platform that runs AI agent workloads — job scheduling, trajectory tracking, environment provisioning, deployments and cost attribution
- Develop Kubernetes controllers and operators to extend platform capabilities for agent orchestration
- Work closely with product and backend teams to ensure platform scalability and reliability
- Build internal tools, automate workflows, and integrate systems to improve team productivity
- Stay current with Kubernetes releases, CNCF ecosystem updates, and cloud-native best practices
Requirements:
- 4+ years of software/platform engineering experience with production systems
- Strong proficiency in Go or Python — you write production code in at least one daily
- Hands-on experience building and deploying services on Kubernetes — not just YAML, you've developed something that runs on K8s
- Experience with GitOps tooling (ArgoCD, Flux, or similar)
- Strong networking and DNS fundamentals — TCP/IP, HTTP, load balancing, DNS resolution, TLS, and debugging connectivity issues
- Solid Linux/OS fundamentals — process management, filesystem, memory, systemd, and comfortable debugging with tools like strace, tcpdump, and netstat
- Relational databases — experience with PostgreSQL, MySQL, or similar; indexing, query optimization, replication, and backup/restore procedures
- NoSQL databases — familiarity with MongoDB, DynamoDB, Redis, or similar for document/key-value workloads
- Caching — experience with Redis, Memcached, or similar for application and infrastructure-level caching
- Message queues & streaming — hands-on with Kafka, SQS, RabbitMQ, or similar for event-driven architectures
- Strong SQL skills for debugging and operational queries
- Comfortable with the CNCF ecosystem — Helm, Kustomize, cert-manager, Ingress controllers, CNI/CSI interfaces
- Hands-on with at least one observability stack (Grafana/Prometheus/Loki, New Relic, Datadog, or similar)
- Familiarity with GCP and/or AWS — managed Kubernetes (GKE/EKS), networking, IAM, storage, and cloud-native services (SES, SQS, S3, etc.)
- Experience with CDN/edge platforms (Cloudflare, CloudFront, or similar)
- Experience building Kubernetes Operators (kubebuilder, operator-sdk, or controller-runtime)
- Experience tuning Kubernetes core components (API server, kubelet, scheduler)
- Familiarity with AI/LLM infrastructure — token management, cost tracking, agent orchestration
- Experience with CI/CD pipelines (GitHub Actions, automated testing, deployment pipelines)
- Infrastructure as Code experience (Terraform, Pulumi, or similar)
- Previous work on large-scale distributed systems or platform-as-a-service
- Startup experience — you thrive in fast-paced, ambiguous environments