Ensure high availability, reliability, and performance of large-scale cloud infrastructure across AWS and GCP environments.
Operate and support infrastructure components and distributed data platforms such as Kubernetes, Kafka, Flink, Storm, and Spark.
Manage and maintain databases including Cassandra, Elasticsearch, Redis, Postgres, and ArangoDB.
Monitor systems, troubleshoot issues, and resolve production incidents across microservices and distributed systems.
Collaborate closely with software engineering teams to debug and resolve complex production problems.
Participate in 24x7 on-call rotation supporting multi-cloud production environments.
Monitor system metrics, application performance, and infrastructure health using observability tools.
Own the incident management lifecycle, including detection, mitigation, Root Cause Analysis (RCA), and post-incident reviews.
Develop and maintain runbooks, automation, and operational processes to improve reliability and efficiency.
Perform capacity planning using system usage and performance data.
Drive SRE best practices, operational standards, and continuous improvement initiatives.

Bachelor’s or Master’s degree in Computer Science, Information Systems, or a related field.
6–10+ years of experience in DevOps, Site Reliability Engineering, or cloud infrastructure roles.
Strong hands-on experience with cloud platforms (AWS or GCP) including services like EC2/GCE, IAM, and object storage (S3/GCS).
Experience with containerization and orchestration technologies, especially Docker and Kubernetes.
Experience building and managing CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab.
Experience with monitoring and observability tools such as Prometheus, CloudWatch, or Stackdriver.
Strong understanding of Linux systems administration and configuration management tools like Ansible.
Experience managing distributed systems and streaming platforms such as Kafka, Cassandra, Elasticsearch, Spark, Flink, or Storm.
Strong automation and scripting skills using Python, Go, Rust, or Shell scripting.
Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
Excellent analytical, troubleshooting, and problem-solving skills.
Strong communication and collaboration skills with the ability to work with cross-functional teams.

Health & Wellbeing We strive to provide our team members and their loved ones with a comprehensive suite of benefits that supports their physical, financial and emotional wellbeing.
Personal & Professional Development We also invest in your career because the better you are, the better we all are. We have specific programs catered to helping you reach any career goals you have — whether you want to become a knowledge expert in your field or apply your skills to another division.
Unconditional Inclusion We are unconditionally inclusive in the way we work and celebrate individual uniqueness.

Senior Site Reliability Engineer, SRE

Key skills