Solved: Re: Adding Large Dataflow Entities To Dataset

Eckstr · ‎06-06-2020

TL;DR:

Big dataflow entity times out when trying to add to a data model.

Goal:

I have a large dataflow entity (approximately 2.4 billion rows) that I want to connect to in Power BI Desktop.

Problem:

Whenever I try to add the dataflow entity to the data model, it tries to download the whole entity. Due to the size of the data in question, the connection times out after approximately two hours. How do I connect to large dataflows in PBI Desktop (or even Service)?

Notes & Additional information:

This is in Premium capacity.
All users of this dataflow will have pro licences.
Filtering the data in Power Query has no effect on the size of the data downloaded. For example, if I filter to just a single row, it will still try to load the full dataflow entity first.
Reloading the dataflow itself with a smaller subset is out of the question. Multiple team members will be using this dataflow, so having everyone changing the dataflow itself everytime they want to build a report is not practical.
I tried changing the timeout options within Power Query, but it hasn't helped.
This is the first (and smallest) of the fact tables we're looking to move from datasets to dataflows, so any solution will have to be scalable to these other tables as well.
Building the transformations from the dataflow into a view on the source system is not an option.
Building a shared dataset that contains all the required information is not possible. The reason is that different team members require different subsets of the data and different dimensions. When taking all these additional dimensions into account, the dataset balloons to over 60GB, well over the 10GB limit of PBI Service.

lbendlin · ‎06-06-2020

yes, deployment pipelines allow you to parameterize the date sources. It is preferable to have data sources that support query folding.

Another option for you would be to evaluate mixed storage mode with aggregations.

View solution in original post

lbendlin · ‎06-06-2020

two notes:

1. Have you tried to access the dataflow via direct query? This is now possible, but I don't know if it helps in your scenario

2. Any chance of normalizing the source data / splitting it into separate dataflows ?

I believe the 10 GB limit is either already lifted or about to be lifted. The idea is that a dataset will use up the available memory of a SKU and then swap data in and out. That allows working with datasets that are larger than half the memory size of the SKU (people tend to forget that each dataset always occupies twice the RAM)

Eckstr · ‎06-06-2020

@lbendlinthanks for the reply.

I've tried DirectQuery preview feature. Unfortunately, it keeps timing out at the visualization level.

The 2.4 billion rows is the normalised data. I've split out the historical data into separate tables, so this is only the most recent month's data.

The news about the dataset restrictions being lifted is great to hear! If I cannot get the dataflow to work, I will look at this option. The big downside of using a dataset is the moment there's a new fact table or dimension, we have to add it in Desktop, then reupload to Service. If this happens well after the initial dataset upload, then all the fact tables will need to be manually updated (we split out the fact tables into history (which remains static) and current (which can be refreshed incrementally). This enables us to load all the data without the server timing out).

Eckstr · ‎06-06-2020

One more constraint that I forgot to mention - due to the size of the different fact tables, we're expecting to probably run into issues with incrementally refreshing a dataset that contains all of these tables as they would all be refreshed at the same time.

With dataflows, we're able to stagger each refresh throughout the day.

lbendlin · ‎06-06-2020

Incremental refresh is done by data source, not by dataset.

Eckstr · ‎06-06-2020

@lbendlin5 of our 6 fact tables have the same datasource.

lbendlin · ‎06-06-2020

ok when I say Datasource I mean "Power Query Query resulting in a Power BI table". Surely you have separate queries for them, right?

Incremental refresh can be configure independently for each of these Power BI tables. Granted, they all have to share the RangeStart and RangeEnd parameters, so this may be a moot point.

Eckstr · ‎06-06-2020

@lbendlinthanks for your assistance with this topic. Really appreciated

Based on our discussion so far, I take it that there's no way to get data from an extremely large Dataflows Entity directly into PBI Desktop?

I also just came across the deployment pipelines feature. I haven't used this feature before. Can I change the Dataflow that a dataset points to in the different stages (Dev, Test, and Prod)? If so, I can build a dataflow that uses a small subset of the original full dataflow and use that for Dev/Test, then just change the datasource to the full dataflow in Prod.

If all else fails, I'll take the dataset approach.

lbendlin · ‎06-06-2020

yes, deployment pipelines allow you to parameterize the date sources. It is preferable to have data sources that support query folding.

Another option for you would be to evaluate mixed storage mode with aggregations.

Eckstr · ‎06-06-2020

I'll try the deployment pipeline to solve this problem. Thanks again for your assistance.

lbendlin · ‎06-06-2020

You may want to look at ALM Toolkit which will allow you to make structural changes (within reason) without requiring a full data reload. (Discaimer: This requires XMLA read/write to be enabled on your SKU. May be incompatible with other preview options)

Adding Large Dataflow Entities To Dataset

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly