Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
fiveone
Helper II
Helper II

How to remove duplicates based on sort order

When loading the data I need to keep only the latest record.

I don't have a date to rely on to determine the latest record but I have a status, based on which I can sort in a somewhat chronological order.

 I thought I could

  • Associate a statusID to each status, making sure that the lowest the ID the latest the event
  • Sort by statusID
  • Remove duplicates

Unfortunately it seems the sort order is not always respected,

I read that we should use the Table.Buffer() function before removing duplicates but that this could create performance issue.

It does indeed create performance issue. It works on a small scale but when I use this on the entire dataset the load process does not complete.

 

Now I’m looking into using Table.SelectRows with a condition that would only return the row with the smallest statusID for each RequestID.

Something along the line of

  

 #"Latest Request" = Table.SelectRows(#"Previous Step", each (List.Max([RequestID]," statusID "))),

 

But that does return a type incompatibility error

5 REPLIES 5
Anonymous
Not applicable

We also had a similar problem. We found this solved our problem:

 

1. Sort data as required

2. Add index column starting at 1

3. Remove Duplicates

 

The index column helps Power BI realise while removing duplicates that there is a specific order to the sorted data.

acstaffa
Frequent Visitor

I have already uploaded a SQL data, how can I create a function to sort the column order and force de removing to follow the date sort? I dont know how to create the function...

 

It is sad PBI dont "follow" the sort order to eliminate duplicates...

MarcelBeug
Community Champion
Community Champion

My suggestion would be the following, illustrated by this video.


I created some test data in Access just to verify if the code would allow for Query Folding (by rightclicking each step in the queries and check for the existence of Native Query). They all did, so performance should be OK..

 

I created a query InputData that just reads the data from Access. For this query, load is disabled.

 

let
    Source = Access.Database(File.Contents("C:\Users\Marcel\Documents\Forum bijdragen\Power BI Community\How to remove duplicates based on sort order.accdb"), [CreateNavigationProperties=true]),
    _Requests = Source{[Schema="",Item="Requests"]}[Data]
in
    _Requests

 

 

I created a query MinStatusByRequestID with input from query InputData that determines the minimum statusID for each RequestID using Group By on the Transform tab. Also for this query, load is disabled.

 

let
    Source = InputData,
    #"Grouped Rows" = Table.Group(Source, {"RequestID"}, {{"minStatus", each List.Min([statusID]), type number}})
in
    #"Grouped Rows"

 

 

I created a query FilteredData that merges the previous 2 queries with an inner join. For this query, load is enabled.

 

let
    Source = InputData,
    #"Merged Queries" = Table.Join(Source,{"RequestID", "statusID"},MinStatusByRequestID,{"RequestID", "minStatus"},JoinKind.Inner),
    #"Removed Columns" = Table.RemoveColumns(#"Merged Queries",{"minStatus"})
in
    #"Removed Columns"

 

Specializing in Power Query Formula Language (M)

 Thank you so much Marcel! I have been struggling for days and googling crazy for a solution and you just solved my problem!!!

Wow, thank you Marcel for taking the time to provide that detailed explanation and even making a video!

Leveraging grouping and merging is brilliant.

 

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.