Solved: Optimizing data model update time

Denis_Slav · ‎07-06-2020

Hello,

What kind of recomendation coul be to reduce model update time?

I've got folders with > 30Gb of CSV data, and each month I add +2Gb. Last try of update take about 2 hour, but it take more time, there update process was terminated.

At the moment. which I found recomendation:

1) I turned off all auto detect date-time column;

2) Delete column which I not use.

Or is it way to increase time to update model?

Alexander76877 · ‎07-08-2020

Cool. The dataflow serves as intermediate data source. Multiple datasets can now consume from the dataflow instead of recreating the import from files. And yes you need to refresh the dataset, now that your dataflow is updated. But see, even reading from a hyperfast single table Azure source, it takes 20min to refresh, that´s the bare minimum under ideal conditions. If I calculated correctly that equals ~200Mbps! on a shared, free ressource! Therefore, if you find a 60min solution reading from text files that need to be processed row per row, I think that should be acceptable. If that solves your problem, please accept as solution to close the case. Alexander

View solution in original post

Alexander76877 · ‎07-06-2020

Hi, does it take 2 hours to refresh the model from PBI desktop or in the service (workspace)? And where do you keep your data? Local harddrive / server or cloud storage? Is it few big files or many hundreds of smaller files?

If your computer / local bandwidth is the limiting factor, then 2 hours for 30+ GB is not too unrealistic.

As a solution, you could

* put your data on cloud storage (e.g. sharepoint) and

* create a dataflow ingesting your data every day.
The dataflow runs on a shared capacity in the cloud, and you can run it during the night so you don´t have to wait for it. With PBI Desktop, you then just read from the dataflow which is really fast.

Alexander

Denis_Slav · ‎07-07-2020

@Alexander76877

All data stored on cloud OneDrive. There are files to each month, at the moment - it 17 files. Refresh time takes through service. version.

Could you give more information about "create a dataflow ingesting your data every day."? I upload new file once per week, and now I trying to refresh all model, in case of first upload.

Alexander76877 · ‎07-07-2020

@Denis_Slav

Hi, a dataflow is the online version of PBI PowerQuery.

https://docs.microsoft.com/en-us/power-bi/transform-model/service-dataflows-create-use

Instead of refreshing the dataSET reading from files you can refresh the dataFLOW first on a weekly basis (or daily if you like) and then refresh the dataset reading from the dataflow.

I would give it a try but honestly don´t think it will help too much as you´re already using cloud storage and compute service.

I wonder why the 2h refresh time bothers you in the first place. You have to wait for a week to get the new file so why would waiting for 2h more be a problem?

What you could do though is to split the query. Instead of importing all historic data with every refresh, create a first static import (disable refresh) for e.g. file 1-15 (=e.g. 28GB) and a second dynamic import for file 16+. You need to append both queries before loading into the dataset. This way, the refresh will only read 2 files instead of 17 which should be much faster. Of course, the 2 files will become more and more over time, therefore you´ll need to move some files from the dynamic to the static import and refresh the static one.

One thing we didn´t touch so far is the complexity of your query. Make sure you´re not folding it (i.e. merge with other tables creating a matrix product) or such things. Try to reduce complexity of the query for trial purposes to the bare minimum: just read the file without any transformations.

Good luck, keep trying.

Alexander

Denis_Slav · ‎07-08-2020

2Alexander, thank you. DataFLow is realy cool, but they don't help me. (

Is it may be better, if I manualy combine files, for example all year to one file?

And how it will be better:

1) Filter neccessary rows in query?

2) Clear unneccessary row in csv?

I tryied to refresh without 5 files (2020 year), and it was take 01:44 > one file takes about 9 min. In case of full 2 years it may take about 03:40. Not optimistic. 🙂 The reason to think about create like a buffer DB, to where upload all data, and after that use it to reports.

Alexander76877 · ‎07-08-2020

Isn´t it strange that 1 file takes 9min but 5 files take 104 instead of 45min? Something´s not right.

It would be best to reduce number of rows and columns in the CSV before ingesting in dataflow /dataset. This will reduce the amount of time for transfer & transformation upfront. How many rows / columns do your files have today?

As a trial, you could manually reduce the files size by removing rows / columns from the CSV. Let me know the new timing.

Alexander

Denis_Slav · ‎07-08-2020

@Alexander76877 Now, it's ok. Because erlyest file has less rows and size. It's avarage time. I suppose it may be different in 2 times. Files from 0.9Gb to 2.4gb.

Now I'm preparing files for 2 test:

1) COmbine by 3 files to one;

2) Reduce rows in files;

It take time 🙂

Denis_Slav · ‎07-08-2020

@Alexander76877

I've got very intresting results of testing

#	Type	Total time	Files	Total rows	Avg time per 1 file	Avg time per 1M row
1	ByMonth	1:44:00	12	75 228 279	0:08:40	0:01:23
2	ByMonth + filter on rows	0:22:01	6	31 751 044	0:03:40	0:00:42
3	ByQuater + filter on rows	0:23:32	2	31 751 044	0:11:46	0:00:44
4	By7Month + filter on rows	0:24:31	1	38 268 574	0:24:31	0:00:38
5	ByQuater + delete rows in dataset	0:20:31	2	19 844 404	0:10:15	0:01:02
6	ByMonth + filter on rows	0:58:03	17	101 852 315	0:03:25	0:00:34

And after refreshing dataflow I manualy refreshed data, and it takes 00:20:17.

Very interesting statistic. There is no any significant difference in speed between FILTRED and DELETED file.

But I'm not clear understand. If dataflow was successfully refreshed, why model and report was not updated? Is I need after refresh model after refreshed dataflow?

Alexander76877 · ‎07-08-2020

Cool. The dataflow serves as intermediate data source. Multiple datasets can now consume from the dataflow instead of recreating the import from files. And yes you need to refresh the dataset, now that your dataflow is updated. But see, even reading from a hyperfast single table Azure source, it takes 20min to refresh, that´s the bare minimum under ideal conditions. If I calculated correctly that equals ~200Mbps! on a shared, free ressource! Therefore, if you find a 60min solution reading from text files that need to be processed row per row, I think that should be acceptable. If that solves your problem, please accept as solution to close the case. Alexander

Anonymous · ‎07-07-2020

Hi @Denis_Slav ,

Check the reference here.

https://powerbi.microsoft.com/en-us/blog/introducing-power-bi-data-prep-wtih-dataflows/

Best Regards,
Kelly

Did I answer your question? Mark my post as a solution!

amitchandak · ‎07-06-2020

@Denis_Slav , refer

https://www.thebiccountant.com/2016/11/08/speed-powerbi-power-query-design-process/

https://docs.microsoft.com/en-us/power-bi/guidance/power-bi-optimization

!! Power BI 101 Interview questions !! !! Master Microsoft Fabric- 36 Videos !!
Microsoft Power BI Learning Resources, 2023 !!
Learn Power BI - Full Course with Dec-2022, with Window, Index, Offset, 100+ Topics !!
Did I answer your question? Mark my post as a solution! Appreciate your Kudos !! Proud to be a Super User! !!

Optimizing data model update time

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly