Core42 is a leader in AI-powered cloud and digital infrastructure, driving transformative technology solutions globally. As a Senior Site Reliability Engineer, you will be responsible for designing and implementing scalable infrastructure to support large-scale AI workloads while collaborating with various teams to enhance system reliability and performance.

Responsibilities:

CI/CD & Automation: Design, build, and maintain robust CI/CD pipelines using tools such as GitLab CI, Azure DevOps, and/or Jenkins to enable rapid and secure software delivery
Kubernetes Operations: Operate, manage, and optimize Kubernetes clusters, ensuring scalability, performance, and resilience
Infrastructure as Code: Develop and maintain infrastructure using Terraform, Helm, Ansible, or similar tools to automate provisioning and configuration
Observability & Monitoring: Implement and manage monitoring solutions using Prometheus, VictoriaMetrics, Grafana, and ELK/EFK to ensure system health and performance
Incident Management: Lead root cause analysis (RCA), post-mortems, and continuous improvement initiatives to enhance system reliability
Reliability Engineering: Define and implement SRE best practices, including SLAs, SLOs, and error budgets
Logging & Alerting: Build and maintain logging, alerting, and tracing systems for proactive issue detection and rapid troubleshooting
Security & Compliance: Enforce security best practices and compliance standards across CI/CD pipelines and runtime environments; support audit readiness
Collaboration: Work cross-functionally with engineering, product, and infrastructure teams to align platform capabilities with business needs
Mentorship: Provide guidance and mentorship to junior engineers and contribute to knowledge sharing across teams
On-call Support: Participate in on-call rotations to support critical platform services

Requirements:

Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field
5+ years of experience in DevOps, Site Reliability Engineering, or platform engineering roles in production environments
Proven experience managing Kubernetes clusters (e.g., GKE, EKS, AKS, or self-managed)
Hands-on experience with CI/CD tools and automation frameworks
Strong experience with infrastructure-as-code tools such as Terraform, Helm, or Ansible
Proficiency in container technologies (Docker, containerd) and orchestration with Kubernetes
Strong scripting/programming skills (e.g., Python, Bash, Go)
Experience with observability and monitoring stacks (Prometheus, Grafana, ELK/EFK)
Solid understanding of Linux systems, networking concepts, and cloud-native security best practices
Experience supporting AI/ML or HPC workloads in production environments
Knowledge of GPU resource management, workload schedulers, and performance tuning
Familiarity with distributed systems and large-scale infrastructure environments
Experience with incident management frameworks and reliability engineering practices
Strong collaboration and communication skills across cross-functional teams

Senior Engineer - Site Reliability

Key skills

About this role

Responsibilities:

Requirements: