Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
eduaguilar
Regular Visitor

What is the best way to handle 50GB dataset in Power BI?

Hello, I have a big dataset (>20GB) for which I need to create a Power BI report. The underlying data is huge but the user should only be looking at a small subset of rows at a time by means of a single select filter. I cannot aggregate the data. The data (for ilustration purposes, there are way more columns) is just a list of transactions per client like this:

 

TransactionIDClientName
TRAN1Client 1
TRAN2

Client 1

TRAN3Client 1
TRAN1Client 2
TRAN2Client 2

 

There are thousands of clients and each client has a ~hundred thousand transactions. The user can only select one client at a time by means of a single-select dropdown. This means that the dataset actually loaded into the report memory is quite small (~hundred thousands of rows worst case scenario if the client has alot of transactions). 

 

Is there a way to load the entire dataset at once? Is direct query a solution to this problem? I really don't want to split my datasets in groups of N clients and create multiple reports. Is there any other solution that is not direct query? 

 

Thanks.

 

 

1 ACCEPTED SOLUTION
TomMartens
Super User
Super User

Hey @eduaguilar ,

 

it's difficult to suggest a solution because as - it depends (as always).

 

First, >20GB (from the text) or 50GB (from the title of this thread) is a difference.

Nevertheless, the size of a Power BI dataset can not be compared with a row-based relational database like SQL Server or Oracle, as Power BI stores the data in a column-oriented data store. This storage comes with many different compression types. This article explains the different types of compression Inside VertiPaq - Compress for success - Data Mozart (data-mozart.com)

Depending on the data distribution (the number of unique values per column) large amounts of relational data can be loaded into a Power BI dataset

Nevertheless, you have to be aware of the dataset size limits of 1 GB for Power BI Pro licensing and 10 GB for Power BI Premium Per Capacity or 100 GB (Premium Per User). If you are using the large data format (a property of a Premium Per Capacity) the dataset size can be extended to the available memory of the capacity, up to 400 GB of a P5 capacity. Remember, due to column-oriented compression 400 GB of RAM can store multiple TB of relational data.

 

Depending on the RAM of your machine you are using Power BI Desktop on, simply try to load all the data, then check the memory consumption of the Microsoft SQL Server Analysis Services subprocess inside the task manager.

 

If it's beyond 600MB you have to think.

 

Of course, direct query is a solution, as it does not come with a dataset size limit. You have to be aware that a single visual can not ask for more than 1000000 (1M) datapoints per query.

 

Depending on the licensing (your budget) you also can consider a hybrid approach, a mixture of imported data and direct query - using aggregations: User-defined aggregations - Power BI | Microsoft Docs
Using aggregations helps to combine fast query performance (imported data) vs. large amounts of data (direct query).

Something you also have to consider is the underlying data source if you consider direct query, each visual on a Power BI report page creates a query, many visuals and many users of your Power BI report create traffic/stress to the data source.

 

Hopefully, this provides some ideas and helps to tackle your challenge.

 

Regards,

Tom 



Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany

View solution in original post

5 REPLIES 5
TomMartens
Super User
Super User

Hey @eduaguilar ,

 

600MB is close but not too close to the file size upload limit if your licensing is based on Power BI Pro.

A pbix can not be larger than 1GB if you are uploading to Pro

A pbix can not be larger than 3GB (P1) or up to 10GB (P5) if you are on Premium Per Capacity or Premium Per User.

What is Microsoft Power BI Premium? - Power BI | Microsoft Docs

 

If you are on Premium you have plenty (depending on the speed of dataset growth) of time to consider incremental load, but even for Premium you have to start thinking now, considering the future data growth.

 

The initial load will become difficult even on Premium per capacity, if the initial load exceeds the file size limit (depending on the P[X]).

If you are on Pro licensing you have to consider directQuery / Aggregations / shrinking the dataset size now.
This article describes some aspects you can consider shrinking your dataset size:  How to reduce your Power BI model size by 90%! - Data Mozart (data-mozart.com)

Please be aware that incremental refresh does not help to outgrow the dataset size limit.

 

Hopefully, this helps why i consider 600MB a magic number 🙂

 

Regards,

Tom



Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany

Thanks Tom, great insights and I am definitely looking into reducing column cardinality since I clearly can do better in that department.

 

I do have one question though, initially you said that I should check "the memory consumption of the Microsoft SQL Server Analysis Services subprocess inside the task manager" and you also said that "600MB is close but not too close to the file size upload limit if your licensing is based on Power BI Pro". 

 

This makes me thing that "actual" data size of the model is the memory consumption by this subprocess and not the actual file size of the PBIX. However, when I upload to the Power BI service the space I am occupying is the actual file size and not the memory consumption of this process. For example, I have a PBIX that is ~100MB in file size but consumes ~600MB of RAM. So, for limits purposes, which of the two numbers matter? Because one is telling me that I can grow only 400MB but the other is telling me that I can grow 900MB which seems more than enough.

 

Thanks.

Hey @eduaguilar ,

 

consider the size that the AS is consuming when loaded.
The pbix file size (at rest) does not consider calculated columns, etc., and some other minor artifacts that will be expanded when loaded into memory.

One aspect that you also might want to consider reducing the dataset size is converting "pre-computed" columns like price * quantity eq sales to measures. Columns like sales can create huge numbers of distinct values, due to the nature of the tabular engine it's likely that your users will not "feel" a performance degradation.

Depending on the number of rows, the above-mentioned optimization is trading in query performance vs memory consumption.

The other day I was helping a team to reduce the dataset size (initial memory consumption when loaded) from ~8GB to ~1.5GB without degrading query performance too much (users start to complain). But this was a time-consuming team effort, start with transforming high-cardinality columns and removing not-needed columns like row-identifier from large fact tables.

Regards,
Tom



Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany
TomMartens
Super User
Super User

Hey @eduaguilar ,

 

it's difficult to suggest a solution because as - it depends (as always).

 

First, >20GB (from the text) or 50GB (from the title of this thread) is a difference.

Nevertheless, the size of a Power BI dataset can not be compared with a row-based relational database like SQL Server or Oracle, as Power BI stores the data in a column-oriented data store. This storage comes with many different compression types. This article explains the different types of compression Inside VertiPaq - Compress for success - Data Mozart (data-mozart.com)

Depending on the data distribution (the number of unique values per column) large amounts of relational data can be loaded into a Power BI dataset

Nevertheless, you have to be aware of the dataset size limits of 1 GB for Power BI Pro licensing and 10 GB for Power BI Premium Per Capacity or 100 GB (Premium Per User). If you are using the large data format (a property of a Premium Per Capacity) the dataset size can be extended to the available memory of the capacity, up to 400 GB of a P5 capacity. Remember, due to column-oriented compression 400 GB of RAM can store multiple TB of relational data.

 

Depending on the RAM of your machine you are using Power BI Desktop on, simply try to load all the data, then check the memory consumption of the Microsoft SQL Server Analysis Services subprocess inside the task manager.

 

If it's beyond 600MB you have to think.

 

Of course, direct query is a solution, as it does not come with a dataset size limit. You have to be aware that a single visual can not ask for more than 1000000 (1M) datapoints per query.

 

Depending on the licensing (your budget) you also can consider a hybrid approach, a mixture of imported data and direct query - using aggregations: User-defined aggregations - Power BI | Microsoft Docs
Using aggregations helps to combine fast query performance (imported data) vs. large amounts of data (direct query).

Something you also have to consider is the underlying data source if you consider direct query, each visual on a Power BI report page creates a query, many visuals and many users of your Power BI report create traffic/stress to the data source.

 

Hopefully, this provides some ideas and helps to tackle your challenge.

 

Regards,

Tom 



Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany

Hi Tom, great suggestions. I did load the model and the ram usage is above 600MB. Why is this the "magic number" at which I should start looking for better solutions? 

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.