UNFI is a company focused on ensuring the stability and performance of its data platforms. The Data Platform Reliability Engineer will be responsible for monitoring, troubleshooting, and automating workflows for Databricks and AWS services, while also supporting BI tools and maintaining operational reliability.
Responsibilities:
- Monitor health and performance of Databricks clusters, jobs, and workflows
- Maintain observability dashboards, alerts, and logs for AWS services and ingestion pipelines
- Respond to incidents, perform root cause analysis, and implement corrective actions
- Monitor and optimize platform costs across cloud and data services
- Implement cost-control measures and provide regular reporting
- Implement and maintain cost controls: cluster policies, auto-termination, right-sizing, job scheduling, storage lifecycle policies
- Monitors spend and utilization for Databricks, AWS, ingestion, and BI services
- Promote performance best practices
- Build and maintain dashboards, alerts, and logs for Databricks, AWS services, ingestion pipelines, and BI refreshes
- Continuously tune alert thresholds to reduce noise and improve signal-to-action ratio
- Ensure end-to-end lineage/traceability for faster fault isolation across stages
- Coordinate with external support teams for day-to-day operations and issue resolution
- Coordinate with vendors for troubleshooting, service improvements, and escalations
- Track and report on SLA adherence and vendor performance
- Maintain operational runbooks, knowledge base, and handoff procedures between internal teams and external partners
- Drive automation and efficiency in operational workflows
- Optimize resource utilization and reduce manual intervention
- Support Power BI, Tableau, and Alteryx operations (gateway health, dataset refresh schedules, workspace/app permissions, data-source connectivity)
- Monitor and improve dataset refresh reliability, query performance, and user access hygiene
- Performs other duties as assigned
Requirements:
- Bachelor's degree in computer science, data analytics, systems analysis, or a related field
- 3+ years in data platform operations or reliability engineering. Hands-on experience with Databricks and AWS services in production environments
- Demonstrated success in maintaining high-impact data platforms, with a strong track record of managing complex environments
- Familiarity with ingestion tools (Fivetran, AWS DMS, DataStage, Informatica) and BI platforms (Power BI, Tableau, Alteryx)
- Experience with SAP, master data management, and cross-functional processes across supply chain, finance, and operations
- Strong troubleshooting and incident management skills
- Knowledge of governance, security, and RBAC principles
- Ability to work independently and collaborate with external partners
- Familiarity with Agile practices and DevOps principles. Understanding of governance, security, and privacy
- Good judgment is required for this position as there may be times when direct supervision may not be immediately available