cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
qsmith83
Responsive Resident
Responsive Resident

How to hide/remove duplicates based on condition

Hi, how do I hide/remove duplicate rows based on the following condition:

  • if [JobID+Ticket] column contains duplicate, keep row with the highest [Sequence] number and remove the other duplicate rows. Screenshot example below - keep Sequence "4676003" and delete the lower sequences above it

Screenshot 2021-11-03 113154.jpg

 

1 ACCEPTED SOLUTION

I'm guessing that it eliminated duplicates for each file individually but there were duplicates across files. I guess you do have to de-duplicate after combining the files then.

 

In this case, you use the same code (with any step references adjusted, e.g. #"Merge Columns" --> #"Inserted Merged Column") but put it at the end of the FreightForward v2 query rather than the Transform Files (2) function.

 

If you're going to use a database eventually, you might want to try approach #4 from my blog post.

View solution in original post

14 REPLIES 14
Belin
Regular Visitor

As per my understanding the funcion "remove duplicates" works from top to bottom, so all you need to do is:

- to append the new data to the old one (you will have now duplicates in your key column

- sort them by the column you desire (for example if you need the most recent, sort it descending by update date, or If you need the highest sequence number, sort them in descending number). The important is that the record you want to keep is on a higher row than the one you want to remove

- now right click on the key column and remove duplicates

 

Sometimes the easiest solution is the best 🙂

AlexisOlson
Super User
Super User

@AlexisOlson, thanks for sharing your post. Its a beast! Which method would you recommend for my case? My datasource is importing spreadsheets from sharepoint with each file containing an average of ~300k rows, current table rows at 3M+ rows.

qsmith83
Responsive Resident
Responsive Resident

I tried option 4 from your blog post but I'm getting an error "after merging with the step before the grouping". @BA_Pete , @mahoneypat any suggestions?

Appreciate any help please.

Link to file if you need

https://drive.google.com/drive/folders/1dmmbhdbzySZBr8oTxj9CWsmztwFQ3f8A?usp=sharing

Screenshot 2021-11-03 205126.jpgScreenshot 2021-11-03 205145.jpg

I don't think query folding is relevant here, so I'd probably go with approach #2 or #3.

I tested #3 (the Table.Max(_, "Col") version) on a subset of your data loaded into an Excel file and it seems to work OK.

 

You could also try including the logic in your Transform File function like this (not tested):

let
    Source = (Parameter2 as binary) => let
    Source = Csv.Document(Parameter2,[Delimiter=",", Columns=66, Encoding=65001, QuoteStyle=QuoteStyle.None]),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Grouped Rows" = Table.Group(#"Promoted Headers", {"JobID+Ticket"}, {{"Tbl", each Table.Max(_, "SEQUENCE"), type record}}),
    ColNames = List.RemoveItems(Table.ColumnNames(#"Promoted Headers"), {"JobID+Ticket"}),
    #"Expand Records" = Table.ExpandRecordColumn(#"Grouped Rows", "Tbl", ColNames, ColNames)
in
    #"Expand Records"
in
    Source

 

Hi @AlexisOlson , I inserted your logic in the 'Transform File' function with no issue, but then when I clicked on the actual table, the following error appears...

Fyi the [JobID+Ticket] column is a manual merge I created in the past.

Screenshot 2021-11-04 082316.jpgScreenshot 2021-11-04 082430.jpg
Details: JobID+Ticket"

OK. Then you can either do the manual merge in the transform before referencing it or else do all of the logic in the big query rather than the function component.

Hi @AlexisOlson , I tried option 3 by doing all of the logic in the big query but its still processing the first step "Group by" for over 30min now and still going...

Can you please provide screenshots of the code and the exact steps using the function component method so we can save time?

I tried manually adding merge step in function component but kept getting same error i.e. "[JobID+Ticket] column not found"...

You've got a big dataset from a non-database source. I wouldn't expect to be very fast regardless of how good the approach is since you have to do it all locally in memory rather than being able to push it back to a server that's good at that kind of thing. Hopefully, it at least gives you progress details when loading.

 

If you're getting [JobID+Ticket] not found, that means you didn't create that column before referencing it or are referencing the wrong step. I think you could insert that merge into the Transform File (2) function like this:

let
    Source = (Parameter2 as binary) => let
    Source = Csv.Document(Parameter2,[Delimiter=",", Columns=66, Encoding=65001, QuoteStyle=QuoteStyle.None]),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Merge Columns" = Table.AddColumn(#"Promoted Headers", "JobID+Ticket", each Text.Combine({[JOB_ID], [TICKET_NBR]}, ""), type text),
    #"Grouped Rows" = Table.Group(#"Merge Columns", {"JobID+Ticket"}, {{"Tbl", each Table.Max(_, "SEQUENCE"), type record}}),
    ColNames = List.RemoveItems(Table.ColumnNames(#"Merge Columns"), {"JobID+Ticket"}),
    #"Expand Records" = Table.ExpandRecordColumn(#"Grouped Rows", "Tbl", ColNames, ColNames)
in
    #"Expand Records"
in
    Source

I can't easily test this on my end since I don't have valid credentials to connect to your SharePoint site so I don't get any previews in the query editor of your file.

Hi @AlexisOlson , that makes sense and thanks for clarifying. The datasource will be transferred to a database in the near future.

Ok so I tried your latest code and it worked without issue however, when doing a test on the front end, the duplicates were still present. Did I miss something in the process?

Fyi I've added a new folder in the Google drive folder link I shared with you that contains a couple of spreadsheets of the raw data if you wanted to replicate it from your end

Sorry for taking up your time, appreciate your patience.

Screenshot 2021-11-04 121842.jpg

I'm guessing that it eliminated duplicates for each file individually but there were duplicates across files. I guess you do have to de-duplicate after combining the files then.

 

In this case, you use the same code (with any step references adjusted, e.g. #"Merge Columns" --> #"Inserted Merged Column") but put it at the end of the FreightForward v2 query rather than the Transform Files (2) function.

 

If you're going to use a database eventually, you might want to try approach #4 from my blog post.

Ok might try your suggestion in the interim as I've just been informed by colleague that they will proceed with migrating the datasource to a database in the coming days which will mitigate the duplicate issues from my end. Thanks again for your help.

Hi @qsmith83 ,

 

It looks like @AlexisOlson  is already helping you with this. You're in excellent hands. 👍

 

Pete



Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!




qsmith83
Responsive Resident
Responsive Resident

@BA_Pete, no prob will wait for Alexis response. Thanks:)

Helpful resources

Announcements
Carousel_PBI_Wave1

2023 Release Wave 1 Plans

Power BI release plans for 2023 release wave 1 describes all new features releasing from April 2023 through September 2023.

Power BI Summit Carousel 2

Global Power BI Training

Make sure you register today for the Power BI Summit 2023. Don't miss all of the great sessions and speakers!

Thank you 2022 Review

2022 Monthly Feature Releases

We had a great 2022 with a ton of feature releases to help you drive a data culture.

Top Solution Authors
Top Kudoed Authors