Data Flows - How to limit data in Power BI Desktop

rdg515 · ‎08-29-2020

I have a large data set that I intend to create as a Data Flow. I am frequently adding fields to the data set, and would like to be able to work with this data set in Power BI without having to download the entire thing. What is best way to do this?

1) Data Flow with Direct Query

- Advantage is no need to download data, but would this impact performance? I know Direct Query isn't as fast as Import, but is it different for a Data Flow that exists in the Service?

2) Setting Range Start and Range End parameters in the Data Flow on the Service

- This doesn't seem to work when I import the Data Flow into Power BI Desktop. It seems I still got the full data set. Is there any point to these parameters for Data Flows?

3) Range Start and Range End parameters in Power BI Desktop

- Since the incremental refresh will be set up on the Data Flow on the Service rather than in Power BI Desktop will this cause permanent filtering on the data source?

Are there any other options?

Sorry if this is a repost; I tried to post the first time without being logged in but it didn't seem to go through.

egorlinus · ‎06-03-2022

Can you try direct query in Dev/Test and import in Production? I think that way you can have best of both options.

lbendlin · ‎08-30-2020

"would like to be able to work with this data set in Power BI without having to download the entire thing"

That statement is a little too fluffy. How will you guarantee that your subset of data is sufficiently representative? This is like the column profiling that is based on the top 1000 rows. You can be very sure that the data on row 1001 will break your assumptions.

RangeStart and RangeEnd can serve your intended purpose if you accept the above risk. But they work a little different. Here's what actually happens.

- In Power Query your data source definition has to include a filter that uses RangeStart and RangeEnd. You can set these parameters as you like, for example to cover the last two months of data.

- In DAX when you configure Incremental refresh you IGNORE the values you set in Power Query (!). Instead you specify to keep the last five years of data but only refresh the last seven days (for example).

That means the values you set in Power Query allow you to only fetch partial data without impacting the actual Incremental Refresh mechanics on the service (sort of - until you push structural changes).

Now - if your data source is a dataflow then I can't really see how incremental refresh would be helpful. You're wasting space by keeping your data both in the dataflow and the dataset. I would go with the direct query option, and if necessary with aggregations, rather than having an import mode connection from a datflows source. Measure the performance and decide what the preferred option should be.

rdg515 · ‎08-31-2020

Are you saying doing the import from the data flow duplicates / wastes space? Won't an import provide improved performance vs a direct query? That is typically the case. Is it different for a Data Flow?

lbendlin · ‎08-31-2020

Yes, that's what I am saying. Using Dataflows as a source in Import mode is not just wasting space, it is also wasting time as you need to first refresh the dataflow and then refresh the dataset.

Mind you, until very recently this was our only option. Direct Query against Dataflows was only made available a few months ago.

rdg515 · ‎08-31-2020

Sorry, I want to make sure I fully undestand because you didn't address my question concerning performance. It's my assumption that direct query is always slower than imports. Is this a correct assumption, even with Direct Query against Data Flows in the Service?

If that's a correct assumption, and I find that the performance is too slow, what are my options?

1) Create summarized queries (same recourse when working with Direct Query in general)

2) Not use Data Flows because it would just be duplicating data

lbendlin · ‎08-31-2020

" It's my assumption that direct query is always slower than imports. "

That's too ambiguous. You need to differentiate between developer performance and end user performance. What is your goal? Do you want to minimize dataset refresh times or do you want to minimize UI response times?

rdg515 · ‎08-31-2020

Minimize UI response time / DAX calculation time is the primary concern because of all the data and complexity of the model.

lbendlin · ‎08-31-2020

Then you need to either go with the wasteful option, or look into a hybrid scenario with aggregations that cover the inital UX (initial visual renders). And of course you will want to examine all your measures with DAX Studio.

Data Flows - How to limit data in Power BI Desktop

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024