Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
Denis_Slav
Helper III
Helper III

Optimizing data model update time

Hello, 

 

What kind of recomendation coul be to reduce model update time?

 

I've got folders with > 30Gb of CSV data, and each month I add +2Gb. Last try of update take about 2 hour, but it take more time, there update process was terminated.  

 

At the moment. which I found recomendation:

1) I turned off all auto detect date-time column;

2) Delete column which I not use. 

 

Or is it way to increase time to update model?

1 ACCEPTED SOLUTION

Cool. The dataflow serves as intermediate data source. Multiple datasets can now consume from the dataflow instead of recreating the import from files. And yes you need to refresh the dataset, now that your dataflow is updated. But see, even reading from a hyperfast single table Azure source, it takes 20min to refresh, that´s the bare minimum under ideal conditions. If I calculated correctly that equals ~200Mbps! on a shared, free ressource! Therefore, if you find a 60min solution reading from text files that need to be processed row per row, I think that should be acceptable. If that solves your problem, please accept as solution to close the case. Alexander

View solution in original post

10 REPLIES 10
Alexander76877
Helper II
Helper II

Hi, does it take 2 hours to refresh the model from PBI desktop or in the service (workspace)? And where do you keep your data? Local harddrive / server or cloud storage? Is it few big files or many hundreds of smaller files?

If your computer / local bandwidth is the limiting factor, then 2 hours for 30+ GB is not too unrealistic.

As a solution, you could

* put your data on cloud storage (e.g. sharepoint) and

* create a dataflow ingesting your data every day.
The dataflow runs on a shared capacity in the cloud, and you can run it during the night so you don´t have to wait for it. With PBI Desktop, you then just read from the dataflow which is really fast.

 

Alexander

@Alexander76877 

All data stored on cloud OneDrive. There are files to each month, at the moment - it 17 files. Refresh time takes through service. version.

 

Could you give more information about "create a dataflow ingesting your data every day."? I upload new file once per week, and now I trying to refresh all model, in case of first upload.  

@Denis_Slav

 

Hi, a dataflow is the online version of PBI PowerQuery.

https://docs.microsoft.com/en-us/power-bi/transform-model/service-dataflows-create-use

Instead of refreshing the dataSET reading from files you can refresh the dataFLOW first on a weekly basis (or daily if you like) and then refresh the dataset reading from the dataflow.

I would give it a try but honestly don´t think it will help too much as you´re already using cloud storage and compute service.

I wonder why the 2h refresh time bothers you in the first place. You have to wait for a week to get the new file so why would waiting for 2h more be a problem?

 

What you could do though is to split the query. Instead of importing all historic data with every refresh, create a first static import (disable refresh) for e.g. file 1-15 (=e.g. 28GB) and a second dynamic import for file 16+. You need to append both queries before loading into the dataset. This way, the refresh will only read 2 files instead of 17 which should be much faster. Of course, the 2 files will become more and more over time, therefore you´ll need to move some files from the dynamic to the static import and refresh the static one.

 

One thing we didn´t touch so far is the complexity of your query. Make sure you´re not folding it (i.e. merge with other tables creating a matrix product) or such things. Try to reduce complexity of the query for trial purposes to the bare minimum: just read the file without any transformations.

 

Good luck, keep trying.

 

Alexander

2Alexander, thank you. DataFLow is realy cool, but they don't help me. (

 

Is it may be better, if I manualy combine files, for example all year to one file? 

 

And how it will be better:

1) Filter neccessary rows in query?

2) Clear unneccessary row in csv?

 

I tryied to refresh without 5 files (2020 year), and it was take 01:44 > one file takes about 9 min. In case of full 2 years it may take about 03:40. Not optimistic. 🙂 The reason to think about create like a buffer DB, to where upload all data, and after that use it to reports. 

Isn´t it strange that 1 file takes 9min but 5 files take 104 instead of 45min? Something´s not right.

It would be best to reduce number of rows and columns in the CSV before ingesting in dataflow /dataset. This will reduce the amount of time for transfer & transformation upfront. How many rows / columns do your files have today?

As a trial, you could manually reduce the files size by removing rows / columns from the CSV. Let me know the new timing.

Alexander

 

@Alexander76877 Now, it's ok. Because erlyest file has less rows and size. It's avarage time. I suppose it may be different in 2 times. Files from 0.9Gb to 2.4gb. 

 

Now I'm preparing files for 2 test:

1) COmbine by 3 files to one;

2) Reduce rows in files;

 

It take time 🙂

@Alexander76877 

I've got very intresting results of testing

#TypeTotal timeFilesTotal rowsAvg time per 1 fileAvg time per 1M row
1ByMonth1:44:001275 228 2790:08:400:01:23
2ByMonth + filter on rows0:22:01631 751 0440:03:400:00:42
3ByQuater + filter on rows0:23:32231 751 0440:11:460:00:44
4By7Month + filter on rows0:24:31138 268 5740:24:310:00:38
5ByQuater + delete rows in dataset0:20:31219 844 4040:10:150:01:02
6ByMonth + filter on rows0:58:0317101 852 3150:03:250:00:34

And after refreshing dataflow I manualy refreshed data, and it takes 00:20:17.

Very interesting statistic. There is no any significant difference in speed between FILTRED and DELETED file. 

But I'm not clear understand. If dataflow was successfully refreshed, why model and report was not updated? Is I need after refresh model after refreshed dataflow? 

 

Cool. The dataflow serves as intermediate data source. Multiple datasets can now consume from the dataflow instead of recreating the import from files. And yes you need to refresh the dataset, now that your dataflow is updated. But see, even reading from a hyperfast single table Azure source, it takes 20min to refresh, that´s the bare minimum under ideal conditions. If I calculated correctly that equals ~200Mbps! on a shared, free ressource! Therefore, if you find a 60min solution reading from text files that need to be processed row per row, I think that should be acceptable. If that solves your problem, please accept as solution to close the case. Alexander

Anonymous
Not applicable

Hi  @Denis_Slav ,

 

Check the reference here.

 

https://powerbi.microsoft.com/en-us/blog/introducing-power-bi-data-prep-wtih-dataflows/

 

Best Regards,
Kelly
Did I answer your question? Mark my post as a solution!

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.