Core42 is a leader in AI-powered cloud and digital infrastructure, driving transformative technology solutions globally. The role involves building an enterprise-grade GPU compute platform, focusing on the design, operation, and management of core services that control GPU resource allocation and consumption at scale.
Responsibilities:
- Design, build, and operate core GPUaaS control plane services
- Develop backend APIs and microservices (Python, Go, or Node.js)
- Integrate deeply with Kubernetes APIs for provisioning, scheduling, and multitenancy
- Build and maintain authentication, authorization, and identity systems (OAuth2, SSO, RBAC, LDAP)
- Design and implement usage tracking and billing systems with strong correctness guarantees
- Design PostgreSQL schemas optimized for scale, auditing, and reliability
- Build CI/CD pipelines and deployment automation for platform services
- Collaborate with infrastructure teams to surface GPU and system telemetry
- Own systems in production including reliability, failure modes, and performance
Requirements:
- 4–7 years of software engineering experience in backend, platform, or infrastructure roles
- Strong backend engineering experience in Python (FastAPI), Go, or Node.js
- Hands-on experience with Kubernetes in production environments
- Experience building and operating REST and/or gRPC APIs
- Strong understanding of microservices architecture and cloud-native systems
- Experience with PostgreSQL schema design, performance, and migrations
- Familiarity with authentication/authorization systems (OAuth2, SAML, JWT, RBAC)
- Experience working on systems that require high reliability and correctness under failure conditions
- Ability to operate independently in ambiguous or greenfield environments
- Experience with GPU infrastructure, HPC environments, or AI/ML platforms
- Experience with Kubernetes controllers, operators, Helm, or cluster lifecycle tooling
- Exposure to Slurm or hybrid Kubernetes/HPC scheduling systems
- Experience with observability stacks (Prometheus, Grafana, OpenTelemetry)
- Experience building developer platforms or internal infrastructure tools
- Familiarity with MLOps tooling (Kubeflow, MLflow, PyTorch pipelines)
- Experience with GitOps workflows (ArgoCD, Flux, etc.)