cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Highlighted
AGo Member
Member

Large amount of data refresh best practice

Hi everybody,

 

I've got Pro license. I've got a folder collecting CSVs from today to the past 12 month, 1 month is approx. 1,5GB. I could use directly folder source, but I'm using it through a dataflow. The refresh is very slow, in the next months I will may be over the 10GBs limit and I'm occupying Microsoft's and my bandwidth in a way that is nonsense, because this data could be subjected to incremental refresh day by day (production machines log). Premium capacity has no reason to be considered in relation to the smallness of the project and this only one necessity.

Does anyone have a best practice to extract and transfer data in csv in a compressed way? Do I need to make a program to inject these CSVs in a database that acts as a more compressed datasource?

Any advice?

 

Thanks very much in advance

1 ACCEPTED SOLUTION

Accepted Solutions
AGo Member
Member

Re: Large amount of data refresh best practice

Thanks to you reply I accepted that I could upload the data on drive/sharepoint and process them directly in the cloud. I managed to open multiple csvs in multiple subfolders using Sharepoint folder connection method. Now the refresh takes a half hour to complete, maybe in the future it will be unsustainable with the data increasing and I'll have to pass to a datalake method, but for now it's working for this test phase. The develop is now made with a parameter limiting the number of rows to 1M to avoid bandwidth saturation and crashes when on desktop and then no limit when online. Thanks again.

View solution in original post

7 REPLIES 7
Super User
Super User

Re: Large amount of data refresh best practice

@AGo You could spin up an Azure Analysis Services instance and build the model in there for cheaper. Then you could do a live connection from Power BI to that model without incurring the larger overhead. This would allow you to scale as well at a lower price point.


Near SE WI? Join our PUG Milwaukee Brew City PUG

Re: Large amount of data refresh best practice

I wonder if you could ZIP them and read them into a dataflow from a OneDrive/SharePoint folder:

https://community.powerbi.com/t5/Desktop/PowerQuery-read-from-folder-with-many-zip-file/td-p/57334

https://github.com/hohlick/Power-Query-Excel-Formats/blob/master/UnZip.pq

 

Not sure of performance on files that size especially with a custom M function.

AGo Member
Member

Re: Large amount of data refresh best practice

I read about this zip topic but it would be my last choice for a workaround because csvs increasing everyday must be automatically zipped in a multi-level hierarchical path in the single zip file.
But the 10gb size limit refers to the pbix that is compressed, maybe there's no problem in my 10Tb limit on sharepoint/drive. I will test it because now these csvs are on a linux machine, and I have to set up a sync with drive.
If it works I hope there is a method to open multiple csvs files on sharepoint/drive as for Folder connection.
If only incremental refresh were available for Pro users...

Re: Large amount of data refresh best practice

10 GB limit refers to in-memory uncompressed...and not including any memory overhead needed.

 

I've stuggled getting .pbix's greater than 3 GB to publish to Premium.

AGo Member
Member

Re: Large amount of data refresh best practice

Explain your first paragraph please.
Anyway I wrote compressed because .pbix if renamed .zip is a compressed container.

Thanks

Re: Large amount of data refresh best practice

I was thinking of published data model size limitation...I see now you probably meant workspace storage limitation.

AGo Member
Member

Re: Large amount of data refresh best practice

Thanks to you reply I accepted that I could upload the data on drive/sharepoint and process them directly in the cloud. I managed to open multiple csvs in multiple subfolders using Sharepoint folder connection method. Now the refresh takes a half hour to complete, maybe in the future it will be unsustainable with the data increasing and I'll have to pass to a datalake method, but for now it's working for this test phase. The develop is now made with a parameter limiting the number of rows to 1M to avoid bandwidth saturation and crashes when on desktop and then no limit when online. Thanks again.

View solution in original post

Helpful resources

Announcements
New Ranks and Rank Icons in 2020

New Ranks and Rank Icons in 2020

Read the announcement for more information!

New Kudos Given Badges Coming

New Kudos Given Badges Coming

We're rolling out new Kudos Given badges. Find out how many Kudos you've given.

Power Platform World Tour

Power Platform World Tour

Find out where you can attend!

Top Solution Authors
Top Kudoed Authors (Last 30 Days)