Solved: Faster Refresh of Data - Doubts about behaviour an...

JotaL · ‎06-15-2023

Hello,

I am not yet an advanced user in Power Query, but I am using it a lot for the last couple of years, so thank you in advance for any help. And sorry in advance for the long text ahead of us.

I have a client that uses 2 different softwares, one to make project estimates (Tool A) and another to control the cost during project execution (Tool B). To send information from Tool A to B, they extract an Excel report from A (existing feature of that tool), which must be transformed so that we can use an Excel import feature in tool B. We are talking about thousands of rows.

We used Power Query for that transformation and a VBA code to enable them to select the report from A, to enable data refresh.

That goes very smoothly.

However, during project execution, there are budget changes, which are also produced and exported from A, and imported to B. This is a bit more complex due to the limitations of tool B, but also because it is required by the client for us to identify all changes, of any nature, between the original import and the current one.

These changes include new items and changed items. Changed items are composed by items that were deleted in the second report, that migrated position, changes in quantities, in cost rates, but also specific changes in description, units of measure and coding. The final output includes two import tables, one for all new items and another for all items that were changed somehow between the two versions, as they need to be imported separated (required by the Cost Control tool).

I created a similar tool, this time picking up too files, which will be the sources in Power Query. I transform those, I generate a couple of queries to limit the number of columns used for comparison. And with those, I make separate queries for all type of changes, and finally I produce the output table which is used for importing data.

This new PQ tool however, went much slower specially with the final users, so i started to optimize the tool.

In my own machine, with the first version of my queries, a project with 5000 rows took 5 minutes to produce both outputs (already with Fast Data Load selected). Now, with my latest query transformation, it takes less than half. So I guess I am in the right path.

In any case, this was a nice exercise for me to understand my limitations, and there are some doubts I have than can help me to improve even more this process:

1. One of the main causes of slowliness I assume to be the amount of look ups that have to be made between old version of the source and the new version.

In many cases we have to retrieve a new (or old) value to compare, for the same cost item. So we must search thousands of rows for each row. What is the fastest solution for this, to use Table.SelectRow or Table.Join, having them properly sorted?

2. And in cases where we just want to verify if the cost item ID exists in the other table? Would you use List.Contains?

3. What is the use of List.Buffer / Table.Buffer in these situations?

Is it even useful to use this feature, when using the previous mentioned formulas? And if so, when to use it? Do we buffer the list or table in the previous step, or directly in the step code (ex. let data = Table.Buffer(#"Table") in...)?

4. Many times I noticed that a query that is calling another one is stuck in the source, even though the other query is already entirely loaded. Most likely I do understand very well how Power Query works, but my assumptions on the way it flows were the following:

- If Query 2 calls Query 1, then Query 2 waits for Query 1 to finish. They are transformed in series. If both queries call a source (same or not), but never each other, including the remaining transformation, they will work in parallel.

- If there are multiple queries (1, 2, 3) sharing a common source and initial transformations steps, then it is better to make that common start a single query (0), and then write the remainder of each of the other queries calling this Query 0, without that start.

- It is useless to buffer any query at the end, if we use that same query multiple times in subsequent queries. My understanding is that for .Buffer formulas, the "buffer" is just kept during the query where it is included. Am I correct?

I am asking this because many times

Am I correct?

5. For repetitive set of steps, does the use of a function speeds up in any way the queries, or it is just a nice way to organize better and clean up a bit the code, but it will slow down?

Thank you again for any feedback.

BA_Pete · ‎06-16-2023

Hi @JotaL ,

Pre-note: Without seeing your exact data, query structure, query code etc. it's almost impossible to advise the best things to do for your scenario. However, I'll try and impart some best practice based on what you've rovided.

-1-

This broadly depends on where each of your comparable sources are in relation to one another and whether they have relatable unique fields. If you can get your unique keys sorted ascending BEFORE bringing in to Power Query, then a Table.Join with the JoinAlgorithm.SortMerge argument would probably be pretty fast, otherwise Table.SelectRows with a buffered table could be fast as well.

-2-

Again, kind of depends on your scenario but, yes, List.Contains is usually a decent function to use for comparing bulk values.

-3-

Ok, so buffering a table or a list (or a binary) shoves the whole thing into memory at runtime and dumps it out at the end of evaluation. It allows you to fix an object in RAM to avoid it having to be reloaded by PQ each time it needs to scan it. A good example might be doing a conditional merge in PQ, where you would buffer the right-hand table so PQ doesn't need to reload it for every row match scan.

The basic syntax for buffering would be something like this:

let
    Source = yourSource,
    bufferListFromOtherQuery = List.Buffer(otherQuery[Column]),
    xForm = List.Contains(bufferListFromOtherQuery, [ColumnToCompare]),
    ...
in
    ...

// You can also do it more inline, something like this:
let
    Source = yourSource,
    xForm = ... List.Contains(List.Buffer(otherQuery[Column]), [ColumnToCompare]),
    ...
in
    ...

I don't know whether this can/would help your scenario without understanding all the variables but it's worth noting that using these incorrectly can actually damage performance.

-4-

i) Yes, that's pretty much how it works.

ii) There's no real benefit performance-wise of making a common start a single query, it's only really a readability and maintenance improvement. There are some edge cases where PQ will cache queries and reuse them for subsequent operations but, broadly, you should be expecting a source hit per query regardless of whether they're duplicated or referenced.

iii) Yes, it's pointless buffering a query at the end. The buffer will get dumped once the query evaluation ends and will not be passed onto a subsequent operation, similar to point ii).

-5-

Functions just package the steps into an easier to use reference The steps still all get performed, you just don't have to type them all out each time. As such, there's no performance difference either way.

Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!

View solution in original post

BA_Pete · ‎06-16-2023

Hi @JotaL ,

Pre-note: Without seeing your exact data, query structure, query code etc. it's almost impossible to advise the best things to do for your scenario. However, I'll try and impart some best practice based on what you've rovided.

-1-

This broadly depends on where each of your comparable sources are in relation to one another and whether they have relatable unique fields. If you can get your unique keys sorted ascending BEFORE bringing in to Power Query, then a Table.Join with the JoinAlgorithm.SortMerge argument would probably be pretty fast, otherwise Table.SelectRows with a buffered table could be fast as well.

-2-

Again, kind of depends on your scenario but, yes, List.Contains is usually a decent function to use for comparing bulk values.

-3-

Ok, so buffering a table or a list (or a binary) shoves the whole thing into memory at runtime and dumps it out at the end of evaluation. It allows you to fix an object in RAM to avoid it having to be reloaded by PQ each time it needs to scan it. A good example might be doing a conditional merge in PQ, where you would buffer the right-hand table so PQ doesn't need to reload it for every row match scan.

The basic syntax for buffering would be something like this:

let
    Source = yourSource,
    bufferListFromOtherQuery = List.Buffer(otherQuery[Column]),
    xForm = List.Contains(bufferListFromOtherQuery, [ColumnToCompare]),
    ...
in
    ...

// You can also do it more inline, something like this:
let
    Source = yourSource,
    xForm = ... List.Contains(List.Buffer(otherQuery[Column]), [ColumnToCompare]),
    ...
in
    ...

I don't know whether this can/would help your scenario without understanding all the variables but it's worth noting that using these incorrectly can actually damage performance.

-4-

i) Yes, that's pretty much how it works.

ii) There's no real benefit performance-wise of making a common start a single query, it's only really a readability and maintenance improvement. There are some edge cases where PQ will cache queries and reuse them for subsequent operations but, broadly, you should be expecting a source hit per query regardless of whether they're duplicated or referenced.

iii) Yes, it's pointless buffering a query at the end. The buffer will get dumped once the query evaluation ends and will not be passed onto a subsequent operation, similar to point ii).

-5-

Functions just package the steps into an easier to use reference The steps still all get performed, you just don't have to type them all out each time. As such, there's no performance difference either way.

Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!

JotaL · ‎06-16-2023

Thank you @BA_Pete ,

Just one follow-up question, regarding 4.ii. So, that common predecessor query will work basically as a function being called at the beginning, which anyway will be performed for each individual query. I will have all those queries working in parallel and not as fork.

And if instead of splitting the queries, I create that common query, buffer it, and I create the remaining as a function, run those functions inside a Table.Combine, in the step after the buffer?

Would this make that fork and improve performance?

BA_Pete · ‎06-16-2023

Hm, it's a good question and I think it goes beyond my understanding of how PQ really works in the background.

I'm fairly sure that the Table.Buffer won't do what you think it will in this scenario. Table/List/Binary.Buffer isn't a cache, it's an operation-specific copy of an object that can be scanned, but isn't persisted as such after the specific use-case.

In terms of whether referencing the start of the query multiple times within the query itself avoids the additional source hits, I just don't know. If you wanted the absolute mind-bending facts about it, you could ask a wizard like Ehren to explain it but, to be honest, I'd personally just test it and see 🙂

Pete

Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!

Faster Refresh of Data - Doubts about behaviour and formulas to use

Helpful resources

Fabric certifications survey

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024