DataFlow and Dataset size

sevenhills · ‎11-22-2021

Hi

I started creating bunch of dataflows and subsequently created datasets using them and later paginated reports. It was working smoothly. Now we added more data years and each dataset size grown as 400MB+.

How do we know the Dataflow size? (Upon searching in this forum, it says not possible)

How do we modify the code and control the dataset size during our development? It is becoming hard to debug or add more tables to it. or any other suggestions for faster development?

TIA

lbendlin · ‎11-23-2021

400 MB seems small. What is your constraint?

To control development dataset size you can use a fake incremental refresh that uses RangeStart and RangeEnd to limit the dev size but then in Power BI you specify that the storage range is the same as the refresh range.

Or use Deployment Pipelines to point at different (sub) data sources.

sevenhills · ‎11-23-2021

Two part to my question:

a) I would like to know is there any way to know the dataflow - size, each table/query size inside the dataflow. I use deployment pipelines between environments. So, I like to compare between UAT and Prod enironments by sizes and code (.json file). FYI, UAT and Prod both points to same datasources.

b) Let me clarify further, Size is not an issue at PBI service. I am aware that it can go upto 10GB.

My Issue is more related to the dataset editing (.pbix file), it is taking too much time and being remote and VPN causes even slower. To do changes I am downloading and editing and then publishing.

Thanks for replying.

lbendlin · ‎11-23-2021

a) you can get that information from the dataflow refresh history. Click on that little download arrow and then look inside the downloaded CSV file to see the partition sizes.

b) My proposal to use the RangeStart and RangeEnd parameters seems to fit your requirement. As you likely know any structural updates will re-trigger the full refresh anyway.

sevenhills · ‎11-23-2021

a) I looked at it before posting but it is not giving the storage size. Unless Max commit is storage size.

For example, check the relevant columns

Rows processed	Bytes processed (KB)	Max commit (KB)
4391838	1639203	156016
180	6	55860
4390648	1676142	122948
4379521	1676142	157216

b) Thanks for elaborating. I am aware and let me rethink about this approach.

Meanwhile I googled, somewhere it says that they are thinking to provide direct query mode to power bi dataflows that are consumed in the datasets, until then, no other options.

Thanks

lbendlin · ‎11-23-2021

Remember the values are shown in KB. Multiply Bytes processed by 1024 to get your partition sizes.

Direct Query against Dataflows is a travesty. All it does is put an Azure SQL database between your blob storage and your dataset.

sevenhills · ‎11-24-2021

a) So you are saying Max commit (KB) is the storage of partition size. Adding each row values will give the whole dataflow size.

I know this is not 100%, but say If this is true, it helps for me to do storage sizes for all dataflows as we have many dataflows.

The idea, to view dataflow sizes, is already submitted: https://ideas.powerbi.com/ideas/idea/?ideaid=5b955cf8-38ac-4864-ac3a-4993cad1b2d6

Thank you

b) I reached out to my team, currently they want all data in all environments as project is young. 😞 I will look for other options.

Thank you

lbendlin · ‎11-24-2021

"So you are saying Max commit (KB) is the storage of partition size."

No. Bytes Processed is the uncompressed dataflow partition size.

" it helps for me to do storage allocation for all reports as we have many reports."

This has nothing to do with reports.

sevenhills · ‎11-24-2021

I am confused, as I started the thread to get storage sizes.

Sorry, I meant Dataflows and updated previous reply.

" it helps for me to do storage sizes for all dataflows as we have many dataflows."

lbendlin · ‎11-24-2021

My opinion is that the storage size of the dataflow partition is represented by the "Bytes processed" column, not by the "Max commit" column.

sevenhills · ‎11-24-2021

I doubt it sir, because we can process lot of bytes and it stores only needed data in compressed format.

This article talks in details https://docs.microsoft.com/en-us/power-bi/transform-model/dataflows/dataflows-understand-optimize-re... ... To cache the entity, Power BI writes it to storage and to SQL."

It is not easy way to determine.

For now, let me use as "All partitions * Bytes Processed * 90%" to do size estimates, as we don't have straight forward approach yet.

Appreciate your reply.

DataFlow and Dataset size

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024