Help build and maintain cloud infrastructure and applications for our Legal AI platform
Collaborate with engineering teams to establish monitoring, incident response, and deployment strategies
Ensure high availability and reliability of our proprietary models and services
Standardise and implement observability practices through logging, traces, metrics, and monitors
Design, deploy, and operate infrastructure to support product teams as we expand into new regions
Add automation around manual operational tasks
Participate in and improve on-call and incident handling processes to ensure 24/7 system reliability
Requirements
3+ years of experience in DevOps or Site Reliability Engineering roles
Proficiency in at least one backend programming language (We use Python)
Strong knowledge of AWS services (ECS, S3, RDS, Lambda, etc.), managed by Terraform
Comfortable troubleshooting across the full stack, starting from the browser, through the networking components, into the containerised applications and then onto data stores.
Knowledge of observability frameworks and tools (We use OpenTelemetry, Cloudwatch & DataDog)
Excellent problem-solving and communication skills
Experience with AI/ML infrastructure deployments is a plus
Tech Stack
AWS
Cloud
Python
Terraform
Benefits
Equity package: Generous equity scheme
everyone gets to be an owner of Robin AI!
Annual leave: 20 days PTO, in addition to the public holidays observed in South Africa.
Growth opportunities: We prioritise promotions for high performers and help you to progress your career.