cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
cathoms
Helper V
Helper V

Remove duplicate rows from a new merged query

Hello! I have two Excel worksheets that I am trying to join into a single query/dataset and I am getting duplicate rows that I can't seem to remove. I've tried this is both Power Query Excel and Power BI.

 

I have data for healthcare provider office visit coding. The goal is to compare the rates at which individual providers are utilizing particular codes with the average for their group and with their own use in the past. There are two types of office visits - new and established (refering to the patients). I tend to start building data models small and then scale up. Sometimes I use Power Query in Excel first, then copy queries into PBI. This isn't efficient but I am still learning.  Anywho, I started with Excel workbooks containing calendar year 2021 quarter one data. One workbook for established visits and one for new visits. In Power Query I removed all the columns I didn't need, etc. from each and then merged them into a new query. When I ExpandTableColumn I just keep the total new visits and freq of each visit type so I end up with columns for provider ID, name, speciality, group, and number of total visits and visits by type. There are 696 rows for the "Established" query, 611 rows in the "New" query, and 696 rows in the merged "All_Visits" query. Great!

 

Then I started again, this time adding CY 2020, Q1 data into new source files. Same columns, just new rows except I added a new "Period" column to distinguish the time periods. I repeated the same steps as previously but now I end up with 1406 rows in "Established", 1169 rows in "New", and 2300 in "All_Visits". I 'removed duplicates' in all three queries but I still end up with numerous duplicates. What am I doing wrong or what am I missing here?

 

You can see the duplicate rows here:

cathoms_0-1620397458048.png

cathoms_1-1620397557135.png

 

As you can see, I have four rows for many providers - two for each time period. In theory I should have only two rows for each provider, one for 2020_Q1 and one for 2021_Q1.

1 ACCEPTED SOLUTION
HotChilli
Super User
Super User

Let's take 13-14 as an example. There are different values for 'Total New Office Visits' and '99203 Freq' therefore the rows are not duplicates.

-------------

Other pairs also have differences. 

 

Either :-

1 of the rows in each case is correct (in which case you need to figure out which one),

they're both wrong,

they're both right .

-----------

It's possible that the Merge Queries is wrong but I don't know your data so can't say.

View solution in original post

8 REPLIES 8
HotChilli
Super User
Super User

Don't be so hard on yourself.

You're on your way to getting this fixed so that's a positive.  Good luck.

HotChilli
Super User
Super User

Let's take 13-14 as an example. There are different values for 'Total New Office Visits' and '99203 Freq' therefore the rows are not duplicates.

-------------

Other pairs also have differences. 

 

Either :-

1 of the rows in each case is correct (in which case you need to figure out which one),

they're both wrong,

they're both right .

-----------

It's possible that the Merge Queries is wrong but I don't know your data so can't say.

Well I feel stupid...

 

Yep, I see the differences now - I was focusing on the duplication of the period and provider combination. Now I need to figure out why I'm getting those differences.

 

Thanks.

HotChilli
Super User
Super User

Hello,

Stephen Tao is asking you to clarify what happens to the data after the 'Remove duplicates' step. Is it the same? The pictures shown above have the focus on the "Expanded.." step so we're seeing it before duplicates have been removed.

-------

My question is related but different.  What rows do you think are duplicated from the picture because I don't see many.  Also, are there any other columns that are not visible?

Ah. Got it. Yes, it looks the same.

 

To answer your question, in my images, of the 14 rows displyed, only 9 & 10 are not duplicated. So, 1-2 are duplicates, 3-4 are duplicates, 5-6, are duplicates, 7-8 are duplicates, 11-12 are duplicates, and 13-14 are duplicates. Am I missing something that suggests thos are not duplicates?

 

There are two columns not displyaed - 99204 Freq and 99205 Freq.

v-stephen-msft
Community Support
Community Support

Hi @cathoms ,

 

I'm sorry I may have asked a stupid question. Will there be any duplicate values after you switch to the step of removing duplicate values?

cathoms_0-1620397458048.png

 

 

Best Regards,

Stephen Tao

 

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

I'm sorry, I don't think I follow. I used the "remove rows" drop down --> "remove duplicate rows". 

cathoms
Helper V
Helper V

Would it be better to keep the different time period data in separate source files, then merge them in Power Query?

Helpful resources

Announcements
September Update

Check it Out!

Click here to learn more about the September 2022 updates!

Power BI Show episode 9

The Power BI Community Show

Watch the playback when Priya Sathy and Charles Webb discuss Datamarts! Kelly also shares Power BI Community updates.

Power BI Dev Camp Session 25

Ted's Dev Camp - August 25, 2022

Watch Session 25 of Ted's Dev Camp.

Top Kudoed Authors