Understanding why Table.Buffer makes a difference in dependency chain
I understand from Google that using Table.Buffer is often a good idea when multiple steps within a query will be referencing the same Table. But it seems from my experimentations that it's also a good idea when there are other downstream queries in the query dependency chain.
As per the question I posted here, I have a bunch of queries in a chain that are somewhat recursive in nature: I am finding matching between two data sources where the date doesn't always match, so I progressively match the tables on larger values of a 'tolerance' factor X using this condition:
Table1.Date = Table2.Date +/- X
And I'm effectiviely running this in a loop for values of X between 1 to 7, and removing successful matches on each pass leaving just unsuccessful ones. This lets me progressively increase the 'mismatch' tolerence on those dates, and remove matches at each pass leaving just the unmatched rows to do increasingly desperate matches on.
My query performs the following steps:
Loads data from Table1 and Table2, and does an inner join on unique ID and Date
Effectively removes these matches from Table1 and Table2 via an anti-join against each referenced table.
Does an inner join on the remaining records, but this time offsets the dates in the Port table by +/- 1 day. This finds some more matches.
Performs steps 2 and 3 over and over, increasing the date offset by 1 each time until I have captured all date mismatches up to a week.
The query dependency tree that results looks like so:
I've found that using Table.Buffer on both Tables whenever I do a join dramatically increases (whoops) decreases execution time.
In this file, with no Table.Buffer, it takes about 17 seconds to run through the chain:
Wow....putting Table.Buffer around any steps that referenced previous queries reduced my load time from potentially hours to one and a half minutes. (I say potentially hours, because I'd never dared to load all 30k rows of data for both Tables at once, nor attempted to make as many as 7 recursive steps. Even just loading one tenth of the data for 5 steps was taking a good 10 minutes or more).
I take it that I want to buffer each input just once, as soon as it appears in the current query that is referencing previous ones?
Say you have a query Q4 that references Q2 and Q3. And say Q2 and Q3 both reference Q1. And say Q1 is pulling from a flat file. According to Erhen at Microsoft, if you're pulling from a flat file - and because File.Contents results aren't cached - the flat file will be read 5 times! It gets read each time Q1 is directly or indirectly referenced: twice in Q4, once in Q3, Q2, and Q1. And I believe he’s saying that this is the case for either Excel or PBI.
So…when doing any kind of complicated chain, the code produced from the UI just flat out sucks! And that is an epic fail for a tool that is designed to be used by non-experts via the UI.
I think a very low number of people facing performance issues because of this behaviour will be able to form a hypothesis about what's going wrong, find a thread on the web that explains the issue in plain English, then go wrap the first reference in any one query to other queries with Table.Buffer. And if you don’t diagnose the issue, you just think that PQ is a slow dog, and switch to using SQL.
Anyone know if there a UserVoice request that addresses this specific behaviour? And whether this behaviour unaddressable now, given the architecture choices MS has hade?