Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
Anonymous
Not applicable

How to hide/remove duplicates based on condition

Hi, how do I hide/remove duplicate rows based on the following condition:

  • if [JobID+Ticket] column contains duplicate, keep row with the highest [Sequence] number and remove the other duplicate rows. Screenshot example below - keep Sequence "4676003" and delete the lower sequences above it

Screenshot 2021-11-03 113154.jpg

 

1 ACCEPTED SOLUTION

I'm guessing that it eliminated duplicates for each file individually but there were duplicates across files. I guess you do have to de-duplicate after combining the files then.

 

In this case, you use the same code (with any step references adjusted, e.g. #"Merge Columns" --> #"Inserted Merged Column") but put it at the end of the FreightForward v2 query rather than the Transform Files (2) function.

 

If you're going to use a database eventually, you might want to try approach #4 from my blog post.

View solution in original post

14 REPLIES 14
Belin
Frequent Visitor

As per my understanding the funcion "remove duplicates" works from top to bottom, so all you need to do is:

- to append the new data to the old one (you will have now duplicates in your key column

- sort them by the column you desire (for example if you need the most recent, sort it descending by update date, or If you need the highest sequence number, sort them in descending number). The important is that the record you want to keep is on a higher row than the one you want to remove

- now right click on the key column and remove duplicates

 

Sometimes the easiest solution is the best 🙂

AlexisOlson
Super User
Super User

Anonymous
Not applicable

@AlexisOlson, thanks for sharing your post. Its a beast! Which method would you recommend for my case? My datasource is importing spreadsheets from sharepoint with each file containing an average of ~300k rows, current table rows at 3M+ rows.

Anonymous
Not applicable

I tried option 4 from your blog post but I'm getting an error "after merging with the step before the grouping". @BA_Pete , @mahoneypat any suggestions?

Appreciate any help please.

Link to file if you need

https://drive.google.com/drive/folders/1dmmbhdbzySZBr8oTxj9CWsmztwFQ3f8A?usp=sharing

Screenshot 2021-11-03 205126.jpgScreenshot 2021-11-03 205145.jpg

I don't think query folding is relevant here, so I'd probably go with approach #2 or #3.

I tested #3 (the Table.Max(_, "Col") version) on a subset of your data loaded into an Excel file and it seems to work OK.

 

You could also try including the logic in your Transform File function like this (not tested):

let
    Source = (Parameter2 as binary) => let
    Source = Csv.Document(Parameter2,[Delimiter=",", Columns=66, Encoding=65001, QuoteStyle=QuoteStyle.None]),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Grouped Rows" = Table.Group(#"Promoted Headers", {"JobID+Ticket"}, {{"Tbl", each Table.Max(_, "SEQUENCE"), type record}}),
    ColNames = List.RemoveItems(Table.ColumnNames(#"Promoted Headers"), {"JobID+Ticket"}),
    #"Expand Records" = Table.ExpandRecordColumn(#"Grouped Rows", "Tbl", ColNames, ColNames)
in
    #"Expand Records"
in
    Source

 

Anonymous
Not applicable

Hi @AlexisOlson , I inserted your logic in the 'Transform File' function with no issue, but then when I clicked on the actual table, the following error appears...

Fyi the [JobID+Ticket] column is a manual merge I created in the past.

Screenshot 2021-11-04 082316.jpgScreenshot 2021-11-04 082430.jpg
Details: JobID+Ticket"

OK. Then you can either do the manual merge in the transform before referencing it or else do all of the logic in the big query rather than the function component.

Anonymous
Not applicable

Hi @AlexisOlson , I tried option 3 by doing all of the logic in the big query but its still processing the first step "Group by" for over 30min now and still going...

Can you please provide screenshots of the code and the exact steps using the function component method so we can save time?

I tried manually adding merge step in function component but kept getting same error i.e. "[JobID+Ticket] column not found"...

You've got a big dataset from a non-database source. I wouldn't expect to be very fast regardless of how good the approach is since you have to do it all locally in memory rather than being able to push it back to a server that's good at that kind of thing. Hopefully, it at least gives you progress details when loading.

 

If you're getting [JobID+Ticket] not found, that means you didn't create that column before referencing it or are referencing the wrong step. I think you could insert that merge into the Transform File (2) function like this:

let
    Source = (Parameter2 as binary) => let
    Source = Csv.Document(Parameter2,[Delimiter=",", Columns=66, Encoding=65001, QuoteStyle=QuoteStyle.None]),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Merge Columns" = Table.AddColumn(#"Promoted Headers", "JobID+Ticket", each Text.Combine({[JOB_ID], [TICKET_NBR]}, ""), type text),
    #"Grouped Rows" = Table.Group(#"Merge Columns", {"JobID+Ticket"}, {{"Tbl", each Table.Max(_, "SEQUENCE"), type record}}),
    ColNames = List.RemoveItems(Table.ColumnNames(#"Merge Columns"), {"JobID+Ticket"}),
    #"Expand Records" = Table.ExpandRecordColumn(#"Grouped Rows", "Tbl", ColNames, ColNames)
in
    #"Expand Records"
in
    Source

I can't easily test this on my end since I don't have valid credentials to connect to your SharePoint site so I don't get any previews in the query editor of your file.

Anonymous
Not applicable

Hi @AlexisOlson , that makes sense and thanks for clarifying. The datasource will be transferred to a database in the near future.

Ok so I tried your latest code and it worked without issue however, when doing a test on the front end, the duplicates were still present. Did I miss something in the process?

Fyi I've added a new folder in the Google drive folder link I shared with you that contains a couple of spreadsheets of the raw data if you wanted to replicate it from your end

Sorry for taking up your time, appreciate your patience.

Screenshot 2021-11-04 121842.jpg

I'm guessing that it eliminated duplicates for each file individually but there were duplicates across files. I guess you do have to de-duplicate after combining the files then.

 

In this case, you use the same code (with any step references adjusted, e.g. #"Merge Columns" --> #"Inserted Merged Column") but put it at the end of the FreightForward v2 query rather than the Transform Files (2) function.

 

If you're going to use a database eventually, you might want to try approach #4 from my blog post.

Anonymous
Not applicable

Ok might try your suggestion in the interim as I've just been informed by colleague that they will proceed with migrating the datasource to a database in the coming days which will mitigate the duplicate issues from my end. Thanks again for your help.

Hi @Anonymous ,

 

It looks like @AlexisOlson  is already helping you with this. You're in excellent hands. 👍

 

Pete



Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!




Anonymous
Not applicable

@BA_Pete, no prob will wait for Alexis response. Thanks:)

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors