Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
blytonpereira
Helper II
Helper II

Alternative to Merge very large tables

I have 2 tables A and B each with about 500,000 rows of data.

 

I would like to merge the two Tables A and B based on the column called ID.

Since each of the datasets have duplicate values, I created a Bridge table that contains unique values of ID from both Table A and B and appended them.

 

I then connect this Bridge table to the Table A and to Table B via a one to many filter.

However in my Table A there are some ID that are not present in Table B and ....there are some ID present in Table B and not in Table A.

Hence when I drag the ID column from my Bridge Table to my workspace in a Table Visualization, there is usually a row that is blank as its related to these missing ID from either Table A or B.

 

I tried to merge the two tables A and B to only keep ID that are present in both Tables, however this is painfully slow. Can anyone suggest an alternative ?

I would like to somehow only keep ID in Table A and B that are present in both table and filter out all other ones.

 

Thanks

 

1 ACCEPTED SOLUTION

Alright, then you do the following:

* Make a duplicate of your Table 1

* Remove all columns apart from your ID column

* Remove duplicates from that column

* Merge the column with the ID column in Table 2.

* Remove any rows that are blank due to the merge not finding a match. 

* Repeat this but merge Table 2 into Table 1 instead.

This should result in you having only ID's that exist in both of the tables.

Hope it works!


Connect on LinkedIn

View solution in original post

7 REPLIES 7
tex628
Community Champion
Community Champion

You need to create a unique key in either table A and B which you can use for the merge. If this isn't possible you can place a report filter on both table A and B to remove blank rows.


Connect on LinkedIn

@tex628When you say create a unique key, I am not clear - do you mean like a key such as that each row in the dataset is unique and can be identified and then do the same for the Table B ....and then perform the merge on the Key ?

Yes, a key that becomes unique in atleast one of the tables. 


Example:

DocumentID                   ProductID                        Key
HTYD2457                       Screw-013                      HTYD2457Screw-013


You can see if the column is distinct by comparing rows with distinct rows in the table view:

image.pngBottom leftBottom left
You of course need to have identical keys in both tables for this to work. If all rows are distinct you have a unique column.


Connect on LinkedIn

@tex628ok thank you unfortunately this is not possible in my tables because of the way the data is present.

 

I need to find another solution to somehow only keep the rows in each of the tables when the ID is present in both of the tables. I tried also creating a query of all of the unique ID values from Table A and then another query of the unique ID values from Table B. I then do a merge back to my original tables to delete the missing IDS, but this is extremely slow since the dataset is very big 500K rows

Alright, then you do the following:

* Make a duplicate of your Table 1

* Remove all columns apart from your ID column

* Remove duplicates from that column

* Merge the column with the ID column in Table 2.

* Remove any rows that are blank due to the merge not finding a match. 

* Repeat this but merge Table 2 into Table 1 instead.

This should result in you having only ID's that exist in both of the tables.

Hope it works!


Connect on LinkedIn

Instead of duplication is there any difference of taking only the requiored column and ADD AS NEW QUERY....or also REFERENCE...

 

Is there any performance differences ? In relating the original tables will be replaced with new data so I think the Duplicate query will need to be rerun manually again when this does happen ?

If you duplicate the correct query any changes to the original query will also affect the duplicate so it should work, but i think your solution is accually better performance wise. Add the column as a new query instead!


Connect on LinkedIn

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.