Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
jaykilleen
Advocate II
Advocate II

How to Improve Query Reference performance for large tables

I have read the Tips for Creating Reports and it says:

 

"When you reference a query you only load the source data once regardless of the number of queries that reference the initial query. "

 

This don't seem to be the case on my machine. I have one base query named Sales that is a Fact table from SQL Server.

 

Sales (~7Million rows of data)

RegionID

ProductID

CustomerID

RepID

DateID

NetRevenue

Quantity

Margin

 

and then 3 queries that reference Sales and group by combinations of different IDs

 

SalesByRep

SalesByCustomer

SalesByProduct

 

All these queries do is reference Sales, group the results and sum the revenue, margin and quantity fields.

 

When I run refresh each of these queries I am watching PowerBI spin for a good 15Minutes and retrieve around 7M recors from SQL Server.

 

This doesn't appear to align with what the tips are saying. Has anyone experienced this and do you have any further tips to try and improve this performance. I have also set Sales attributes 'Load to Report' and 'Enable Refresh of this Query' to FALSE.

 

 

Is this something I can improve with Table.Buffer or DirectQuery? Any ideas would be greatly appreciated.

 

1 ACCEPTED SOLUTION
pqian
Employee
Employee

@jaykilleen I see you are aware of things like Table.Buffer, so you are pretty advanced PowerQuery user.

 

I can go into a little detail about your question. The answer up front is the tip is somewhat true. The rule to remember here is each query, when been loaded to report, will always be evaluated in isolation. They may even be evaluated by different processes and thus not form a dependent tree structure.

 

For example, if you have a base query A, and then two new queries B and C referencing A. A isn't loaded to report but B and C are. When you hit the Apply button, B and C will simultaneously start loading to report. In both evaluations, A will be evaluated separately, because B isn't aware of C. 

 

Now if we have this model in mind, you can argue the tip isn't true at all. However, PowerQuery evaluations keeps a cache of data seen by evaluations on disk. So if you are within the same cache session and pulled on A multiple times, you will essentially only pay for it the first time (unless of course, you are pulling on different pages of A). This cache will ONLY apply to raw data coming from the data source, any additional transformations will need to be performed on top of it. 

 

Finally, since when you load queries to report, you always want the latest data. So each loading session is essentially a new cache session. Therefore, the tip is somewhat true here. You will only need to pay for the data coming from data source A ONCE per loading all of the queries. (Interestingly for most data sources this is true for duplicates of A as well). However, you will pay for the transformations on top of A N times where N = the number of queries been loaded.

 

With this knowledge in mind, you can see that Table.Buffer doesn't really help you either. What that does is guarantee a stable output from within a query (e.g., between multiple lets). So essentially you declare a point in your data where from there on all transformations will be done on the local copy, vs folded remotely.

 

Finally, yes, DirectQuery will most certainly help with the performance here. WIth DQ you always operate on the remote data source and never keep a local copy of the data. So there isn't a "loading" phase. You will pay for the data as the visuals need them (where it's mostly aggregated).

 

Does this make sense?

 

Regards,

PQ

View solution in original post

23 REPLIES 23
Anonymous
Not applicable

Hi there! I'm experiencing the same issue :(. I have 50 csv that are combined, then split/transform and at some point merged 3 times.

Hi!

 

I know, it is an old topic, but I think I found a real good explanation about referencing, caching and buffering when the source is static (so Excel or CSV, not SQL server).

 

https://social.technet.microsoft.com/Forums/en-US/8d5ee632-fdff-4ba2-b150-bb3591f955fb/queries-evalu...

 

The hint to this article is coming from this post on the forum: 

https://community.powerbi.com/t5/Desktop/Using-quot-Reference-quot-in-Query-Editor-does-not-prevent/...

 

In my current case, i have a PBIX files based on 5 Exel files, each 3 MB. They are based on Essbase, so quite structured.

I want to create a dynamic date table based on the pulled fact data. We are talking about a maximum of 36 months.

So, there is the query A, ETL for the fact, and building a column Date. There is then a second query, "Date", referencing to A. It seems simple, but the refresh of "Date" is taking more than 5 minutes!

Therefore, i am trying to optimise this. Any other idea is welcome.

 

Cheers. G

Dimitar
Frequent Visitor

Hi, thanks for this valuable exchange.

I wonder if we are supposed to index the fields of the source table on which we are filtering in Power Query ? Is this goint to accelerate the retrieval of the 'wanted' data.

Thanks.

healthEteam
Resolver I
Resolver I

Just a question as I am learning myself.

 

In these examples where there are multiple queries that are basically refiltered versions of the original query, is it not more efficient to bring the only the main query in once and then create the subqueries within the pbx model by creating separate calculated tables, filtering the main table. 

 

I have a situation similar, bring in one query of a fact table, then have created 2 other calculated tables that are similar to your filtered queries above.

 

This way there is only one query that is run vs. 3.

It probably depends how big the share of the data is that you need at the end. If you basically need all the data, just in separate tables, then the caching and referencing will probably be the fastest.

 

But if you need just need 2 different years for example from 10 years data (in 2 different queries/tables), then I'd expect 2 parallel queries who are folding back to the server to be faster.

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

pqian
Employee
Employee

@jaykilleen I see you are aware of things like Table.Buffer, so you are pretty advanced PowerQuery user.

 

I can go into a little detail about your question. The answer up front is the tip is somewhat true. The rule to remember here is each query, when been loaded to report, will always be evaluated in isolation. They may even be evaluated by different processes and thus not form a dependent tree structure.

 

For example, if you have a base query A, and then two new queries B and C referencing A. A isn't loaded to report but B and C are. When you hit the Apply button, B and C will simultaneously start loading to report. In both evaluations, A will be evaluated separately, because B isn't aware of C. 

 

Now if we have this model in mind, you can argue the tip isn't true at all. However, PowerQuery evaluations keeps a cache of data seen by evaluations on disk. So if you are within the same cache session and pulled on A multiple times, you will essentially only pay for it the first time (unless of course, you are pulling on different pages of A). This cache will ONLY apply to raw data coming from the data source, any additional transformations will need to be performed on top of it. 

 

Finally, since when you load queries to report, you always want the latest data. So each loading session is essentially a new cache session. Therefore, the tip is somewhat true here. You will only need to pay for the data coming from data source A ONCE per loading all of the queries. (Interestingly for most data sources this is true for duplicates of A as well). However, you will pay for the transformations on top of A N times where N = the number of queries been loaded.

 

With this knowledge in mind, you can see that Table.Buffer doesn't really help you either. What that does is guarantee a stable output from within a query (e.g., between multiple lets). So essentially you declare a point in your data where from there on all transformations will be done on the local copy, vs folded remotely.

 

Finally, yes, DirectQuery will most certainly help with the performance here. WIth DQ you always operate on the remote data source and never keep a local copy of the data. So there isn't a "loading" phase. You will pay for the data as the visuals need them (where it's mostly aggregated).

 

Does this make sense?

 

Regards,

PQ

It makes total sense. Smiley Very Happy

Jaderson Almeida
Business Coordinator

@pqian Thanks for all this detail. I think I will experiment more with DirectQuery and also just the way I build up my queries.

I think I am in a position where I should be utilising DAX on the Report View instead of creating some many queries for basic things like this.

 

This should also help keep my cache low (it is currently not difficult to get close to the 4GB limit which I know I can increase but I also run VMs on my machine and they fall halt when there is not enough spare memory).

@pqian thank you so much for this explanation - very much appreciated. Especially as there doesn't seem to me so much about this available at the moment - or could you give more guidance on where to find infos like this?

 

Is my understanding correct that there wouldn't be a cache to share if B & C would fold back to the server?:

 

Like 1st query A is only adressing the whole table in the server and query B is filtering P&L - accounts for example. B would fold back to the server, so only the selected accounts in B will be returned from the server.

 

If query C now selects different accounts from A (Balance Sheet), only those would be returned.

 

Hence: Your example of A providing a cache to share for B and C would only work if the query folding would be broken before B & C are executed.

 

So in cases where direct query isn't an option or becomes too slow for the user (when there are interactions on the reports that would trigger expensive calculations during load), the Table.Buffer could actually make sense: Namely in those cases where the alternatives of multiple folding (sub-)queries are slower than using Table.Buffer as a folder-stop at the end of query A?

Admittedly, this might not be the case very often.

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

@ImkeF Your observation is correct. Essentially folding the query differently to the server could potentially break the cache sharing. Whether Table.Buffer make sense in this case is still unclear to me. In the example you gave, if filtering B and C reduces the datasize dramatically, then it may still be better if you push both queries to the server and load the data separately. 

 

We are touching onto implementation details of PowerQuery/PowerBI desktop. It is not a public contract and should only serve as guidelines on how optimization should be carried out. So they aren't documented anywhere, although some experimentation should quickly reveal what's going on behind the scenes.

Dimitar
Frequent Visitor

Hi, thanks for the useful tips.

I wander if we have to index all fields in our source table, on which we are making filters in Power Query?

Is this going to accelerate the selection of the 'right' data for using in PBI?

Thanks

Not sure what exactly you mean with "Indexing".

If you mean Table.AddIndexColumn, this will slow your queries down.

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

Dimitar
Frequent Visitor

Hi ImkeF,

I mean creating an index in the MySQL database on the fields on which I will filter later in PowerQuery. I expect an index on such fields should help to select the rows which I need to keep in my model.

Thanks

Yes, that's correct (provided MySQL-queries fold, which I don't know).

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

Hello @ImkeF. I've pulled in to Power BI a population of accounts (single column) using a native sql query. I then want to use those accounts as a filter for other queries that will hit different tables in the database. Would you expect those merges (inner joins) to occur faster if I Table.Buffer the native SQL query?

 

(I won't be doing any kind of additional filtering on the other tables meaning I generally expect to get the same population of accounts from every table).

I was surprised by the performance so often, that I would always check both options 🙂

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

Hi!
I got a question about performance, I am using as source Excel files.
The are currently located on our local server. 
Did you notice, do you know it it is better to have them stored on some MS cloud, like OneDrive, SharePoint or so, or if it at the end does not play a role?

Despite some googling, i could not find an answer to that.

Side note: we are using O365 but the Excel files are stored on our local server.

Cheers

G.

It is very unlikely that moving to web storage will improve performance. But if you have very slow on prem performance it might be worth a try.

What would actually improve performance for most cases is to use csv sources instead of xls(x). So if there is a chance to move to csv instead, you will probably see speed improvement.

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

Hi,
A 8gb ram and i5 processor could slow things as well? What sort of PC specifications should help me on it? Maybe, SSD, 16gb, i7 and 256ram video?

ImkeF
Super User
Super User

Have you actually tested that the refresh with one referencing query is substantially faster than with 3 (say: 5-7 minutes)?

 

 Do you load your fact table to the model or just the referencing queries?

 

Do your queries contain custom SQL-statements? They would break query folding: https://www.mssqltips.com/sqlservertip/3635/query-folding-in-power-query-to-improve-performance/

 

 

 

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.