Veeva Systems is a mission-driven organization and pioneer in industry cloud, helping life sciences companies bring therapies to patients faster. As a Data Engineer, you will own the end-to-end development lifecycle, collaborating with a high-performing engineering team to design, build, and deploy high-impact features for Veeva's life sciences customers.
Responsibilities:
- Architect and build resilient, distributed data processing systems using Python and Spark on AWS
- Design and implement end-to-end ETL/ELT workflows that ingest and unify data from diverse sources —ranging from modern table formats like Iceberg and Delta to legacy business files such as Excel and CSV —ensuring a scalable and consistent single source of truth for the organization
- Lead the implementation of the Medallion Architecture, managing data maturity through Bronze, Silver, and Gold layers
- You will define how data is structured, classified, and stored to maximize business value while ensuring scalability and high availability
- Build reusable libraries and frameworks for data quality validation, metadata tracking, and pipeline monitoring
- Build CI/CD process, to automate deployment and testing to maintain a high bar for engineering excellence
- Enforce data governance standards, including security, privacy, and regulatory compliance
- Proactively monitor system health, implement automated observability, and resolve complex bottlenecks in distributed systems to ensure peak resource efficiency and cost-effectiveness
- Partner directly with Product Managers and Data Scientists to translate business requirements into innovative solutions
- Own the full feature lifecycle—from initial whiteboarding to production deployment and long-term maintenance
Requirements:
- 4+ years of professional data engineering experience with a demonstrated ability to architect and deploy production-grade data platforms from scratch
- Expert-level proficiency in Python and Apache Spark, with specific experience in JVM tuning, memory management, and optimizing execution plans for large-scale distributed workloads
- Deep expertise in modern data architecture, software design patterns, and various data modeling techniques designed for scalability and performance
- Proven track record of building on AWS (primary) or GCP, including hands-on experience with managed services like EMR or Databricks
- Extensive experience designing and managing complex data lifecycles using orchestration tools such as Airflow, AWS Step Functions, or Prefect
- Deep understanding of data cleansing, curation, and transformation strategies, coupled with experience implementing data governance, security, and lifecycle management policies
- Strong background in building reusable libraries, frameworks, and internal tools that standardize data ingestion and automate ETL/ELT workflows
- Exceptional debugging skills for distributed systems and resolving performance bottlenecks at scale
- Proficiency with CI/CD tools and processes (e.g. Codefresh, Jenkins)
- Excellent verbal and written communication skills in English, with the ability to translate complex technical architectures into actionable insights for stakeholders and cross-functional teams
- Must be located in EST or CST
- Applicants must have the unrestricted right to work in the United States. Veeva will not provide sponsorship at this time
- Relevant certifications (e.g., AWS, Spark, or similar)
- Familiarity with streaming and distributed technologies such as Spark Streaming, EKS, Kinesis, or Apache Kafka
- Experience implementing or managing modern cloud data warehouses or lakehouse architectures
- Prior experience working in the Life Sciences industry