Senior Data Engineer - Python & PySpark
Job Summary
We are seeking an experienced Senior Data Engineer with strong expertise in Python, PySpark, SQL, and Big Data technologies.
The ideal candidate will be responsible for designing, developing, and optimizing scalable data pipelines and ETL/ELT workflows for processing large volumes of structured and unstructured data. The role requires hands-on experience with distributed data processing, cloud platforms, orchestration tools, and performance optimization of big data applications.
Key Responsibilities
Data Pipeline Development
- Design, develop, and maintain scalable data pipelines using:
- Python
- Apache Spark / PySpark
- Build reusable and efficient data processing frameworks.
ETL / ELT Development
- Develop and optimize ETL/ELT workflows for:
- Data ingestion
- Data transformation
- Data processing
- Process large volumes of structured and unstructured data.
Big Data Processing
- Work with big data technologies such as:
- Hadoop ecosystem
- Hive
- Spark
- Implement distributed computing solutions for high-performance processing.
Data Modeling & Warehousing
- Support:
- Data modeling
- Data architecture
- Data warehousing solutions
- Ensure scalability and maintainability of data systems.
SQL & Database Management
- Write and optimize:
- Complex SQL queries
- Data transformation logic
- Work with:
- Relational databases
- Non-relational databases
Cloud & Orchestration
- Deploy and manage data solutions on cloud platforms such as:
- AWS
- Azure
- Google Cloud Platform
- Work with orchestration tools like:
Data Quality & Governance
- Perform:
- Data validation
- Data cleansing
- Data transformation
- Ensure compliance with:
- Data governance
- Security standards
Performance Optimization
- Optimize:
- Spark jobs
- SQL queries
- Data pipelines
- Improve:
- Scalability
- Reliability
- Processing performance
Collaboration & Agile Delivery
- Collaborate with:
- Data Analysts
- Data Scientists
- DevOps teams
- Business stakeholders
- Participate in:
- Agile ceremonies
- Sprint planning
- Continuous improvement initiatives
Required Skills
Programming & Data Engineering
- Python
- PySpark
- Apache Spark
- SQL
Big Data Technologies
- Hadoop ecosystem
- Hive
- Distributed computing platforms
ETL / ELT & Orchestration
- ETL / ELT pipelines
- Apache Airflow or similar orchestration tools
Cloud Platforms
- AWS / Azure / Google Cloud Platform
- Cloud-based data services
Databases & Data Warehousing
- Relational databases
- NoSQL databases
- Data warehousing concepts
- Data modeling
File Formats
Soft Skills
- Strong analytical and troubleshooting skills
- Excellent communication and collaboration abilities
- Ability to work with cross-functional teams
Experience Required
- 6-10+ years of experience in:
- Data Engineering
- Big Data technologies
- Distributed data processing
Preferred Skills
- Performance tuning and optimization expertise
- Experience with scalable cloud-native data architectures
- Exposure to DevOps and CI/CD for data platforms