PowerQuery: Merge Queries HORRIBLE performance

Anonymous · ‎12-12-2018

Hi everyone,

we're really (really really) struggling with Power Query and pulling data in to Power BI. Our data (which sits in SQL server) is:

DepartmentDim

ProviderDim

PatientDim

VisitFact (which links to the above 3 dimensions)

Here's what I want to do: first, select a small group of departments I'm interested in by filtering the DepartmentDim. Then pull in all rows in VisitFact related to those departments. Then load ProviderDim and PatientDim with only the rows which actually match the values which came in to VisitFact.

So basically the steps I go through:

1. Filter down DepartmentDim where department name contains some text I'm interested in

2. "Merge as new" the VisitFact with DepartmentDim, using an inner join on the department key column

At this point, even though only a very small number of rows in VisitFact match the values from the department data (less than 30 thousand), Power Query DOESN'T fold this - I can see it running through all 40 million rows on our VisitFact to generate the preview. This takes forever - 4 or 5 minutes.

3. Then I "expand" the DepartmentDim column that shows up on the merged table, and choose to add the "department key" column from the department dim. I then "remove" that column because I don't really want it - I just needed the inner join to filter down the rows

At this point, I'd like to then take the merged table and use it to "merge as new" with the patient and provider tables, doing inner joins on both, so that the resulting patient and provider tables are filtered down to only valid rows that matched in the VisitFact. But at this point the painfully slow performance is killing me.

Now the interesting pieces:

A. If I right mouse click on the SOURCE step for the merged query - "view native query" is grayed out - meaning this isn't getting pushed down. This is consistent with what I observer with it reading in all 40 million rows and taking forever

B. If I right mouse click on the "expand column" step below that, the "view native query" IS active - and I can see (correct) SQL. This step also displays data instantly.

The reason this is bugging me is because we'd like to do this on fact tables that have BILLIONS of rows - but our machines catch on fire reading in that many rows...and my hair grows whiter waiting for it to finish. Give up after 30+ minutes. It just doesn't make sense to me why Power Query is doing this...it's an inner join between two tables, so why is it manually processing that step instead of something like "select * from table_a inner join table_b on key column".....should only take a second to get the return set.

Is there another way to do this? Some way to "trick" the front end of Power Query so this step folds?

p.s. I do NOT want to write SQL as that means joining it up with other stuff later on won't fold, and our end users aren't always SQL savy. And I really don't want to learn M code - because it seems hard, and end users definitely shouldn't have to learn that. This is a simple inner join on two tables - should be able to do it (performantly) directly in the GUI?

Thanks!

Scott

Wayfarer · ‎03-17-2020

Thanks for the pointers @Anonymous. I'll keep your tips in mind; there's also further learning and investigation to do on my part.

@edhans Yep, I was thinking that. This was just an add on of "I have performance issues too!".

Wayfarer · ‎03-17-2020

I've been trying to use merge a bit too, and basically come out with a "I give up". I don't think sitting waiting for half an hour for my data to refresh, or apply changes, is a good use of my time at work.

Maybe I don't understand what it's for? I'm dealing with millions of rows in 10-20 tables, just for this one report. My background makes me work in a way of "filter first" and to link the individual queries (i.e tables) in Power Query together - not to get literally everything and filter afterwards, and rely on Power BI relationships to do so as well.

Is that not how one should approach working in Power BI? Is an inner merge query for something else, not to be used the same as a t-sql inner join?

edhans · ‎03-17-2020

@Wayfarer consider starting a new thread with your exact circumstances. FWIW, I've used Power BI to merge very large tables from SQL Server and all of it folds so the server does all of the work.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

Anonymous · ‎03-18-2020

Hi @edhans , the queries definitely fold when you RUN the load - but they are not being utilized in the user interface, that's all done by the power query engine itself (for some reason). So in cases of large tables, it's very possible to run an entire load more quickly than you can get the "Merge" operation user interface to show up.

Hope this clarifies!

Scott

Anonymous · ‎03-17-2020

Hi @Wayfarer , I've been doing a lot of work using merges against dimensional models lately, and we've come up with some info and tricks that may help.

First, some merge basics that we've puzzled out:

1. When doing a merge, the GUI does not submit any SQL to the database for results that we can see.

2. If you use a "left outer" type merge, the merge GUI will just take the preview data for Table A and slap a "magic column" on it (the column that can be expanded) to represent Table B. This is fast and performant...as it just "reuses" the preview data and doesn't try to compute anything

3. If you use an "inner" type merge - the GUI can no longer figure out what rows to display, because rows from Table A can be dropped if there was no matching row in Table B

3a. But because the GUI doesn't submit SQL to the databsae, this is done in the Power Query back end

3b. Meaning it pulls all rows from Table A, then pulls all rows from Table B, and then manually joins them in the Power Query engine. You can see this - if you look in the bottom right corner of Power Query, you can see the huge number of rows being pulled down

3c. This is HORRIBLY SLOW when you use tables that have any volume...anything with over a million rows starts to take forever

4. You can "trick" Power Query into changing a left join merge into an inner merge by doing the following:

4a. Use a left outer join in the merge GUI, for example TableA.col1 join to TableB.col1

4b. Then expand TableB in the next step

4c. Last and not least, put a Remove Empty filter on the TableB.col1 column. When Power Query sees this, it converts the left outer join into an inner join automatically (not in the GUI - but in the SQL that gets submitted to the database)

4d. This is a good compromise, it gets the fast GUI response of a "left merge" but still submits inner joins to the underlying database

Unfortunately I don't have experience with the other join types...left and inner are the only ones I've tried.

Hope this helps!

Scott

ebailly1 · ‎11-07-2019

Is there any documentation anywhere that explains why the Merge Queries function is absolutely horrible and if there is a solution in the works? Anyone looking to utilize Power Query cannot use this function because it brings development to a standstill having to wait for things to process.

Anonymous · ‎11-07-2019

Hi @ebailly1 , I have not seen any documentation at all. I have brought this up multiple times with the development team, for instance at MBAS this summer and through a premium support ticket. The issue seems to be:

- If using "left join" type logic - the UI essentially doesn't have to compute anything, it just slaps a dummy column on the end. This displays quickly in the UI because it doesn't really need to do anything

- If using "inner" type logic - the UI should (but doesn't) send a query to the database. Instead, for whatever reason, it pulls in both tables and tries to manually do the join in the Power Query engine instead of pushing it down to the database to do it

We've been working around this in two different ways, either manually adding Table.AddJoinColumn steps, or by always doing left joins. If we do left joins, we then sometimes do a "not null" type test to force the results to be the same as an inner join. My colleague has done some testing, and it seems like it at least some of the cases this actually DOES result in an inner join being sent to the database. We haven't exhaustively tested this though, and not sure if this happens all the time, if steps have to be in a certain order to make it work, etc.

One other note is that once you click on the "Expand" for the column after the merge step - then the UI does start sending the queries to the database again, and performance gets much better. It's ONLY on the merge step that the UI tries to do it itself vs. using the database.

I sure wish Microsoft would fix these issues when hitting SQL server databases:

1. This issue

2. The table/view navigation screen not being able to show synonyms. I can select tables or views, synonyms should show up also and work exactly the same

3. The inability to use a SQL override as the source step, but then being able to add additional steps that still get pushed down (for example, filtering, aggregation, etc.). If it would just submit the query, substituting the SQL in "as if" it were a table name, everything would work fine - and would remove the huge bottleneck that SQL overrides stop getting pushed down

Sorry I don't have better answers...I've been pounding on these with Microsoft for 6+ months at this point but unfortunately don't see any progress being made.

Thanks,

Scott

v-lili6-msft · ‎12-13-2018

hi, @Anonymous

here are some suggestions for you refer to

1. When you connect to the SQL Server you can expand the "Advanced options" and input a SQL query to pull in the data that you want.

https://community.powerbi.com/t5/Desktop/Custom-SQL-instead-of-table-or-view/td-p/96463

2. Use directquery instead of import, do these steps in SQL

https://docs.microsoft.com/en-us/power-bi/power-bi-reports-performance

3. "Removed Duplicates" as this blog

https://blog.crossjoin.co.uk/2018/03/16/improving-the-performance-of-aggregation-after-a-merge-in-power-bi-and-excel-power-query-gettransform/

Best Regards,

Lin

Community Support Team _ Lin
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Anonymous · ‎12-19-2018

Hi Lin,

I actually tried the ideas in Chris Webb's blog already - "tricking" Power Query into knowing what the primary key was didn't help performance at all.

Note that the merge isn't a problem "after" the columns from the 2nd table have been expanded, but the step that actually does the merge itself (BEFORE you can expand the columns) is. Power Query seems to import the entire tables for some reason...no idea why.

Also, we're trying not to ask end users to have to write SQL - one because some of them can't do that...but also because if you use a SQL statement but then have to do any processing in Power Query to do anything else, it can't push down, causing huge performance issues.

I have not tried the "use direct query" option yet - we import everything to get the performance of in-memory. But maybe I could start with direct queries and then flip them to import - I'll give these a try and see if, in direct query mode, this issue goes away. It wouldn't be a perfect solution (for example, if we needed to later change the ETL then there is no way to go from "import" back to "direct query" - but at least it would be something.

Thank you!

Scott

Anonymous · ‎12-19-2018

p.s. when I say "Note that the merge isn't a problem "after" the columns from the 2nd table have been expanded, but the step that actually does the merge itself (BEFORE you can expand the columns) is. Power Query seems to import the entire tables for some reason...no idea why"

what I mean is - I don't understand why power query has to pull in the entire tables...if it just needs to list the columns so you can use the "expand columns" function, then it should (at most) only need to bring back a single row. Actually it "shouldn't" need to do this - Power Query already has the list of columns from the merged table before it even starts the merge. But if for some reason this step needs to show them, then instead of joining locally in PBI (which it seems to be doing), it should do a "select top 1 * from a join b" type of thing.

I'd be happy to do a webex to demonstrate the exact issue if that would help - we'd love to get this resolved. We've got users trying to join tables with tens to hundreds of millions of rows in...and this bug grinds development to a halt.

Thanks!

Scott

PowerQuery: Merge Queries HORRIBLE performance

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly