Ingest the world: Design and maintain ingestion frameworks for high-volume, structured and unstructured data-from operational systems, APIs, file drops, and events.
Support streaming and batch use cases across latency windows.
Transform at scale: Develop transformation logic using SQL, Python, Spark, and modern declarative tools like dbt or sqlmesh.
You’ll handle deduplication, windowing, watermarking, late-arriving data, and more.
Curate for trust: Collaborate with domain teams to annotate datasets with metadata, ownership, PII classification, and usage lineage.
Enforce naming standards, partitioning schemes, and schema evolution policies.
Optimize for the lakehouse: Work within a modern lakehouse architecture -leveraging Delta Lake, S3, Glue, and EMR -to ensure scalable performance and queryability across real-time and historical views.
Build for observability: Instrument your pipelines with quality checks, cost visibility, and lineage hooks.
Integrate with OpenMetadata, Prometheus, or OpenLineage to ensure platform reliability and traceability.
Enable production-readiness: Support deployment workflows via GitHub Actions, Terraform, and IaC patterns.
Your code will be versioned, testable, and safe for multi-tenant deployments.
Think platform-first: Everything you build should be reusable. You’ll help codify data engineering standards, create scaffolding for onboarding new datasets, and drive automation over repetition.

Must-Haves Python(PySpark) & SQL — Non-negotiable.
Strong working proficiency in both.
AWS — Solid understanding of AWS services beyond just data engineering (storage, compute, networking, IAM, etc.). Preference for candidates already working within the AWS ecosystem.
Data Fundamentals & Data Pipeline Optimization — Working knowledge of optimizing pipelines for cost efficiency and resource utilization.
Interest in working in Platform Engineering
Good to Have Platform Engineering Mindset — Must have a genuine interest in platform/infrastructure work, not just pipeline development.
Cultural fit on this is important — we don't want drop-offs post-interview.
Containerization & Orchestration — Conceptual understanding or hands-on experience with Docker and Kubernetes.
Cloud Migration / Multi-cloud — Experience with cloud migrations or working across multi-cloud environments.
AI/ML — Any exposure to AI/ML concepts or tooling is a bonus, not a requirement.
Infrastructure as Code (IaC) — Familiarity with IaC tooling (Terraform, CDK, etc.).
Observability — Familiarity with tools like Grafana and Prometheus for monitoring and alerting.

Data Engineer

Key skills