Retail ETL Pipeline
Python + Azure Data Factory
The Challenge
A retail chain with 20+ stores was losing revenue due to stockouts and overstock. Their POS systems generated thousands of transactions daily, but the data lived in isolated on-premise databases with no integration layer — making demand forecasting impossible.
Solution
Built a cloud-native ETL pipeline on Azure to unify all data sources and enable predictive inventory management.
Ingestion Layer
- Connected to 20+ store POS systems using Azure Data Factory pipelines with REST and ODBC connectors
- Implemented incremental load patterns using watermark timestamps — processing only new/changed records per run
- Handled schema drift and malformed records with built-in ADF data flows
Transformation Layer
- Built Python-based transformation scripts running on Azure Databricks (PySpark)
- Applied business rules: category normalization, duplicate removal, currency conversion, and outlier flagging
- Generated aggregated metrics: daily/weekly sell-through rates, days-of-supply, ABC classification
Storage & Serving Layer
- Loaded clean data into Azure Synapse Analytics (formerly SQL DW) partitioned by store and date
- Created Synapse SQL views consumed directly by Power BI for inventory and replenishment dashboards
Results
- 2M+ records processed daily with a pipeline runtime under 25 minutes
- 18% reduction in stockout events in the first quarter after deployment
- ~$120K estimated savings per year in excess inventory carrying costs
- Achieved 99.2% pipeline uptime over 6 months
Stack
Python · PySpark · Azure Data Factory · Azure Databricks · Azure Synapse Analytics · SQL · Power BI