Tebra is a company focused on modernizing healthcare for independent practices. They are seeking a Senior Data Engineer who will architect and operate data infrastructure to support AI/ML initiatives, transforming healthcare data into high-quality training sets and real-time features.
Responsibilities:
- Architect and write software that solves complex business problems, specifically designing scalable pipelines for feature extraction, training data generation, and model monitoring logs
- Own and serve as a Subject Matter Expert (SME) for large software systems, such as the organization's Feature Store or Data Lakehouse, ensuring data availability for both experimentation and production inference
- Continuously monitor data pipelines in production, detect data drift or quality anomalies, and implement automated recovery systems to ensure the reliability and freshness of features and training data over time
- Lead Engineering Design Reviews, providing well-articulated and reasoned explanations for architecture decisions (e.g., choosing between batch processing for training vs. real-time streaming for inference)
- Write software frameworks that can be extended by others on the team, such as automated data quality checks and schema validation tools that prevent training-serving skew
- Translate business requirements into software solutions, bridging the gap between raw data sources and the structured inputs needed for advanced ML models
- Know when and how to optimize complex code, specifically tuning Spark jobs or SQL queries to handle massive datasets required for Large Language Model (LLM) fine-tuning or deep learning
- Collaborate cross-functionally including ML engineers to implement MLOps best practices, including data versioning, lineage tracking, and reproducibility
- Expert at scoping tasks, breaking down complex data infrastructure initiatives into manageable deliverables for the squad
Requirements:
- 5+ years of professional software development experience
- Deep technical subject matter expertise in 3+ general areas of software development (e.g., Big Data Processing, Distributed Systems, Data Modeling)
- 3+ years of hands-on experience in Data Engineering with a focus on supporting analytics or data science teams
- Advanced proficiency in Python and SQL. You are comfortable writing production-grade code for data transformation and orchestration (not just scripts)
- Proven ability to architect and write software that enables ML at scale—moving beyond simple ETL to building robust data platforms
- Strong background in modern data infrastructure relevant to AI (e.g., Spark, Airflow, Kafka, Vector Databases)
- Experience with Data Lake/Lakehouse architectures (e.g., Databricks, Snowflake, Delta Lake) and understanding how to structure data for efficient model training
- Familiarity with MLOps concepts: You understand the difference between a training set and a test set, and you know what 'data leakage' is and how to prevent it in the pipeline
- Proven ability to deploy and maintain data systems in production with CI/CD, monitoring, and alerting
- Excellent technical communication and a product mindset—comfortable driving initiatives from concept to delivery
- Background in healthcare software operations or working with structured business data
- Experience implementing or managing a Feature Store (e.g., Feast, Tecton)
- Familiarity with Data Versioning Control tools (e.g., DVC, LakeFS)
- Published research or conference papers in data engineering, distributed systems, or machine learning
- Experience with retrieval-augmented generation (RAG) pipelines or vector search infrastructure
- Contributions to open-source data or ML infrastructure projects