Building, managing, and optimizing the underlying infrastructure and tools for large-scale data processing workloads.
Designing systems for collecting metrics (Prometheus) and visualizing data (Grafana).
Architecting and managing the platforms where Spark runs, such as Kubernetes clusters or cloud services like AWS (EKS).
Packaging Spark workloads and integrating them with orchestration systems.
Deploying Infrastructure via Terraform/Ansible and troubleshooting job failures.
Building automation and tools in languages like Python, Java, or Scala, Linux Scripting (Bash).
Implementing and maintaining systems for monitoring, logging, and alerting.
Developing and optimizing the data catalog platform (e.g., Apache Iceberg).
Collaborating with Data Stewards, Analysts, and Scientists to address data needs and issues.
Creating and maintaining documentation for Kubernetes infrastructure and providing training to team members.
Requirements
Bachelor's degree in computer science or a related field, or equivalent experience, typically 7 years in a DevOps or Systems Engineering role.
Expertise in Apache Spark: Deep understanding of Spark architecture, including RDDs, DataFrames, execution hierarchy, lazy evaluation, shuffling, and fault tolerance.
Proficiency in languages used for Spark development and automation, such as Python, Pyspark and Scala/Java.
Proficient in Linux Scripting (Bash).
Proficient in writing SQL.
Experience in CI/CD tools, Github.
Experience in setting up and using observability tools like Prometheus, Grafana etc.
Strong knowledge on Networking Protocols (TCP/IP, DNS, Load Balancer etc.) and hardware components.
Automation via Terraform/Ansible.
Hands-on experience with on-prem and major cloud providers (AWS, Azure, GCP) and container orchestration tools like Docker and Kubernetes.
Hands-on experience setting up IAM, VPC, EC2 etc.
Familiarity with related technologies and formats like Delta Lake, Apache Iceberg, Apache Kafka, Hadoop, and various data storage systems (S3, HDFS, etc.).
Hands-on experience with Databricks, Snowflake, Apache Iceberg, Unity Catalog, or similar tools.
Solid understanding of data lakes and governance.
Experience setting up, maintaining caching layers like Alluxio.
Strong analytical skills for debugging complex distributed systems issues.
Strong communication and collaboration abilities.
Tech Stack
Ansible
Apache
AWS
Azure
Cloud
Distributed Systems
DNS
Docker
EC2
Google Cloud Platform
Grafana
Hadoop
HDFS
Java
Kafka
Kubernetes
Linux
Prometheus
PySpark
Python
Scala
Spark
SQL
TCP/IP
Terraform
Unity
Benefits
Best-in-class Benefits to eligible employees
Expert guidance and always-on tools
Support physically, financially and emotionally during big milestones and in everyday life