Core42 is a leader in AI-powered cloud and digital infrastructure, driving transformative technology solutions globally. As a Senior Site Reliability Engineer, you will be responsible for designing and implementing scalable infrastructure to support large-scale AI workloads while collaborating with various teams to enhance system reliability and performance.
Responsibilities:
- CI/CD & Automation: Design, build, and maintain robust CI/CD pipelines using tools such as GitLab CI, Azure DevOps, and/or Jenkins to enable rapid and secure software delivery
- Kubernetes Operations: Operate, manage, and optimize Kubernetes clusters, ensuring scalability, performance, and resilience
- Infrastructure as Code: Develop and maintain infrastructure using Terraform, Helm, Ansible, or similar tools to automate provisioning and configuration
- Observability & Monitoring: Implement and manage monitoring solutions using Prometheus, VictoriaMetrics, Grafana, and ELK/EFK to ensure system health and performance
- Incident Management: Lead root cause analysis (RCA), post-mortems, and continuous improvement initiatives to enhance system reliability
- Reliability Engineering: Define and implement SRE best practices, including SLAs, SLOs, and error budgets
- Logging & Alerting: Build and maintain logging, alerting, and tracing systems for proactive issue detection and rapid troubleshooting
- Security & Compliance: Enforce security best practices and compliance standards across CI/CD pipelines and runtime environments; support audit readiness
- Collaboration: Work cross-functionally with engineering, product, and infrastructure teams to align platform capabilities with business needs
- Mentorship: Provide guidance and mentorship to junior engineers and contribute to knowledge sharing across teams
- On-call Support: Participate in on-call rotations to support critical platform services
Requirements:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field
- 5+ years of experience in DevOps, Site Reliability Engineering, or platform engineering roles in production environments
- Proven experience managing Kubernetes clusters (e.g., GKE, EKS, AKS, or self-managed)
- Hands-on experience with CI/CD tools and automation frameworks
- Strong experience with infrastructure-as-code tools such as Terraform, Helm, or Ansible
- Proficiency in container technologies (Docker, containerd) and orchestration with Kubernetes
- Strong scripting/programming skills (e.g., Python, Bash, Go)
- Experience with observability and monitoring stacks (Prometheus, Grafana, ELK/EFK)
- Solid understanding of Linux systems, networking concepts, and cloud-native security best practices
- Experience supporting AI/ML or HPC workloads in production environments
- Knowledge of GPU resource management, workload schedulers, and performance tuning
- Familiarity with distributed systems and large-scale infrastructure environments
- Experience with incident management frameworks and reliability engineering practices
- Strong collaboration and communication skills across cross-functional teams