CVS Health is dedicated to shaping a more connected and compassionate health experience. They are seeking a Principal Data Engineer to develop large-scale data structures and pipelines, collaborate with the data science team, and ensure data quality and accessibility standards. This role involves building data models, integrating data from various sources, and analyzing IT environments to recommend solutions.
Responsibilities:
- Develop large scale data structures and pipelines to organize, collect and standardize data that helps generate insights and addresses reporting needs
- Collaborate with data science team to transform data and integrate algorithms and models into automated processes
- Build data marts and data models to support Data Science and other internal customers
- Integrate data from a variety of sources, assuring that they adhere to data quality and accessibility standards
- Analyze current information technology environments to identify and assess critical capabilities and recommend solutions
- Build high-performance data processing frameworks leveraging cloud and/or on-premise data platform
- Design and build large-scale data structures, pipelines, and efficient Extract/Load/Transform (ETL) workflows
- Write ETL (Extract / Transform / Load) processes, design database systems, and develop tools for real-time and offline analytic processing
- Transform data and integrate algorithms and models into automated processes
- Analyze and synthesize data to meet the insights, reporting dashboard, and descriptive/predictive/prescriptive analytic requirements
- Design conformed, aggregated, and semantic data layers, and manipulating large datasets to support insights and analytics using SQL, BTEQ, SAS, and similar tools, as applicable
- Data management in building data layers in Sandbox or a production environment for reporting and analytical use cases
- Work on 'big data' platforms, including Hadoop (Azure or GCP preferred) and Spark, as applicable
- Design data models and solutions for analytical and reporting use cases
- Use knowledge in Hadoop architecture, HDFS commands, and experience as applicable, designing and optimizing queries to build data pipelines
- Use strong programming skills in Python, Java, and/or any of the major languages to build robust data pipelines and dynamic systems
- Experiment with available software tools and advise on new tools in order to determine optimal solution given the requirements dictated by the model/use case
- Support modeling/diagramming and build design specifications for data objects and surrounding data processing logic
- Collaborate with business solution strategists and support new data source onboarding process through data discovery, profiling, and mapping
- Participate in proof of concepts to build the data layers and concepts to derive analytical insights
- Leverage multiple tools and programming languages to analyze and manipulate data sets from disparate data sources
Requirements:
- Master's degree (or foreign equivalent) in Computer Science, Computer Information Systems, Data Science, Statistics, Mathematics, Analytics, or a related field
- two (2) years of experience in the job offered or related occupation
- two (2) years of experience in Cloud migration technologies: Azure, Amazon Web Services (AWS), or Google Cloud Platform (GCP)
- two (2) years of experience in Messaging platform: Kafka
- two (2) years of experience in Containerization runtime platform
- two (2) years of experience in Solution Architecture, design, and end-to-end delivery of projects
- two (2) years of experience in Domain support for healthcare or retail organization
- two (2) years of experience in Build Proof of Value (PoV) and MVP using AI: Generative AI, AutoML, or Virtual AI Databases
- two (2) years of experience in Provide guidance on Large Language Model (LLM) selection and use of Minimum Viable Products (MVPs)
- two (2) years of experience in Conduct Data Quality assessments, define Data Governance processes through Data Quality and MLOps
- two (2) years of experience in Establish data architectures and best practices