Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
sevenhills
Super User
Super User

DataFlow and Dataset size

Hi 

 

I started creating bunch of dataflows and subsequently created datasets using them and later paginated reports. It was working smoothly. Now we added more data years and each dataset size grown as 400MB+.

 

How do we know the Dataflow size? (Upon searching in this forum, it says not possible)

 

How do we modify the code and control the dataset size during our development? It is becoming hard to debug or add more tables to it. or any other suggestions for faster development?

 

TIA

10 REPLIES 10
lbendlin
Super User
Super User

400 MB seems small.  What is your constraint?

 

To control development dataset size you can use a fake incremental refresh that uses RangeStart and RangeEnd to limit the dev size but then in Power BI you specify that the storage range is the same as the refresh range.

 

Or use Deployment Pipelines to point at different (sub) data sources.

Two part to my question:

 

a) I would like to know is there any way to know the dataflow - size, each table/query size inside the dataflow. I use deployment pipelines between environments. So, I like to compare between UAT and Prod enironments by sizes and code (.json file). FYI, UAT and Prod both points to same datasources.

 

b) Let me clarify further, Size is not an issue at PBI service. I am aware that it can go upto 10GB.

 

My Issue is more related to the dataset editing (.pbix file), it is taking too much time and being remote and VPN causes even slower. To do changes I am downloading and editing and then publishing.  

Thanks for replying.

a) you can get that information from the dataflow refresh history. Click on that little download arrow and then look inside the downloaded CSV file to see the partition sizes.

 

lbendlin_0-1637719517161.png

 

b) My proposal to use the RangeStart and RangeEnd parameters seems to fit your requirement. As you likely know any structural updates will re-trigger the full refresh anyway.

 

a) I looked at it before posting but it is not giving the storage size. Unless Max commit is storage size. 

 

For example, check the relevant columns

Rows processedBytes processed (KB)Max commit (KB)
43918381639203156016
180655860
43906481676142122948
43795211676142157216

 

b) Thanks for elaborating. I am aware and let me rethink about this approach. 

 

Meanwhile I googled, somewhere it says that they are thinking to provide direct query mode to power bi dataflows that are consumed in the datasets, until then, no other options.  

 

Thanks

Remember the values are shown in KB. Multiply Bytes processed by 1024 to get your partition sizes.

 

Direct Query against Dataflows is a travesty. All it does is put an Azure SQL database between your blob storage and your dataset.

a) So you are saying Max commit (KB) is the storage of partition size. Adding each row values will give the whole dataflow size.

 

I know this is not 100%, but say If this is true, it helps for me to do storage sizes for all dataflows as we have many dataflows.

 

The idea, to view dataflow sizes, is already submitted: https://ideas.powerbi.com/ideas/idea/?ideaid=5b955cf8-38ac-4864-ac3a-4993cad1b2d6

 

Thank you

 

b) I reached out to my team, currently they want all data in all environments as project is young. 😞 I will look for other options.

 

Thank you

"So you are saying Max commit (KB) is the storage of partition size."

 

No. Bytes Processed is the uncompressed dataflow partition size.

 

" it helps for me to do storage allocation for all reports as we have many reports."

 

This has nothing to do with reports.

I am confused, as I started the thread to get storage sizes.

 

Sorry, I meant Dataflows and updated previous reply.

" it helps for me to do storage sizes for all dataflows as we have many dataflows."

My opinion is that the storage size of the dataflow partition is represented by the "Bytes processed" column, not by the "Max commit"  column.

I doubt it sir, because we can process lot of bytes and it stores only needed data in compressed format. 

 

This article talks in details https://docs.microsoft.com/en-us/power-bi/transform-model/dataflows/dataflows-understand-optimize-re... ... To cache the entity, Power BI writes it to storage and to SQL."

 

It is not easy way to determine.

For now, let me use as "All partitions * Bytes Processed * 90%" to do size estimates, as we don't have straight forward approach yet.  

 

Appreciate your reply. 

 

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors