Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
I have a situation where I need to load a 60000 csv file into PowerQuery for modelling into a star schema. We don't have Premium so I cannot chain dataflows. Loading the file into a dataflow, and using this as a source for a dataset is causing that dataset to refresh MUCH slower than simply loading the csv directly into a dataset and refreshing that dataset. Does anyone know why?
Full background:
Plan a: My initial approach was to load the csv file into a dataflow and perform initial cleansing and transformations on it. Then, within the dataset, use the dataflow as a source and perform the steps required to create the dimensions and facts. However, whilst this approach worked when using a sample extract of 100 rows, it didn't scale. The dataflow refreshed ok with 60k rows, but the dataset didn't - it would always time out. For 100 rows, it took 90 seconds. For 500 rows, it would take 350-450 seconds. It would not complete for all 60000 rows.
Plan b: I then created a different dataset that used the csv file as its source. The dataset then performed all the steps the dataflow did, plus all the dim/fact creation that the dataset in Plan a did - so the dataset in plan b is doing more work than in plan a. For 100 rows, it took 5 seconds. For 500 rows, it took 13 seconds. It was able to handle all 60000 rows in just under 10 minutes.
So, I'll clearly be going with plan b! But I don't understand why the dataset was so slow in plan a (when it was using the dataflow source that already had transformations done to it). Any ideas would be most welcome!
Hi @m13eam ,
Complex or large dataflows can take more time to refresh, as can poorly optimized dataflows. The following sections provide guidance on how to mitigate long refresh durations.
Especially with a lot of transformations because using shared capacity it is sharing the memory and CPU.
For more information, you can refer these documentations:
Understanding and optimizing dataflows refresh - Power BI | Microsoft Learn
Hope these help you.
Best Regards,
Community Support Team _Yinliw
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Thanks but the issue I'm having is not with the dataflow. My issue is with the dataset. In otherwise identical datasets:
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.