Solved: Re: Power Query Editor - Remove Duplicates (select...

AliRade · ‎09-18-2021

Hi Community,

I have a dataset where I want to remove 'selective duplicates'. I've tried it by sorting the column order and then removing duplicates (via a Buffer table) but that removes all Duplicates (even though i want to keep some).

Example dataset:

ITEM	ID	Cust	Date	Level
1	A200	A1	16/08/2021	2
1	A200	A1	16/07/2021	1
2	A300	B1	15/09/2021	2
2	A300	B2	15/09/2021	2
2	A300	B1	13/07/2021	1
3	A400	A1	18/01/2021	1
4	A500	A1	14/07/2021	3
4	A500	B1	14/07/2021	3
4	A500	B2	14/07/2021	3
4	A500	B2	13/06/2021	2
4	A500	B1	12/06/2021	2
4	A500	C1	14/07/2021	3
4	A500	C1	12/06/2021	2
4	A500	A1	01/04/2021	1

The Output I'm trying to get from the above Dataset is that for EACH "ID", output (or keep) the rows which has the latest "Level" number (even if the latest Level number for a particular "ID" is appearing more than once). So wanting the output table to look like below.

ITEM	ID	Cust	Date	Level
1	A200	A1	16/08/2021	2
2	A300	B1	15/09/2021	2
2	A300	B2	15/09/2021	2
3	A400	A1	18/01/2021	1
4	A500	A1	14/07/2021	3
4	A500	B1	14/07/2021	3
4	A500	B2	14/07/2021	3
4	A500	C1	14/07/2021	3

What I've done is a sort (descending) on Colum "Level" and then remove Duplicate Rows Column "ID". But this way removes all duplicates in "ID" even though I would like to keep some (for example for ID=A500, i want to keep all rows that have the latest Level).

Any ideas on how to go about this? I've tried looking into 'grouping' but not getting the output i want.

Thanks in advance for any suggestions.

wdx223_Daniel · ‎09-19-2021

View solution in original post

jennratten · ‎09-18-2021

Here is how you can accomplish this. I have added comments in the script below that explains what's going on.

Add a column with the date formatted to m/d/yyyy.
Group rows by Item, Id, Cust with a new column for the max date and all rows as nested tables.
Add a new column that selects the value of the Level column in the nested table, for records whose formatted date equals the max date.

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("jdBJCsAgDADAv3gWjHGpPVafIf7/GzWBQqylenLJkK1WZZVWFwLQQXcbDSSDgPRA1fQ3OR5imSDFHJPMv8HAOWaRBNeEs7ipkCPiRS/JgB2Jp1gQxIss7k3yBsFN0tuN40RTIfwjZd1LWWfhoftSwIu9tBs=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ITEM = _t, ID = _t, Cust = _t, Date = _t, Level = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ITEM", Int64.Type}, {"ID", type text}, {"Cust", type text}, {"Date", type text}, {"Level", Int64.Type}}),
    // Add a column with the date formatted to m/d/yyyy.
    // You could transform the existing date column instead if desired, 
    // but adding as a new column allows you to retain the existing format in the final table.
    FormattedDate = Table.AddColumn(#"Changed Type", "Formatted Date", each Date.From ( Text.Combine({Text.Middle([Date], 4, 2), Text.Start([Date], 2), Text.Middle([Date], 5)}) ), type date),
    Grouped = Table.Group(FormattedDate, {"ITEM", "ID", "Cust"}, {{"Max Date", each List.Max([Formatted Date]), type date}, {"All", each _, type table [ITEM=nullable number, ID=nullable text, Cust=nullable text, Date=nullable text, Level=nullable number, Formatted Date=date]}}),
    AddLevel = Table.AddColumn ( Grouped, "Level", each 
        // Create variables
        let 
            // max date of each row in the Grouped table
            varMaxDate = [Max Date],
            // Nested table filtered 
            FilterGroupedTables = Table.SelectRows ( [All], each varMaxDate = _[Formatted Date] )
        in 
        // Return the value of the Level column in the nested table.
            FilterGroupedTables[Level]{0}, Int64.Type
    ),
    RemoveHelperColumns = Table.RemoveColumns(AddLevel,{"Max Date", "All"})
in
    RemoveHelperColumns

AliRade · ‎09-19-2021

Thanks for the detailed solution @jennratten . This worked perfectly.

However, I'm trying to avoid using 'Date' as some of the dates are null, and some of the latest 'Level' have an older date (sorting by max date would then not be an accurate indicator).

Ideally, I want to only group by Cust and Level. This will then output all IDs that have the latest 'Level' regardless of taking 'Date' into account.

Example, added Cust "D1"

ITEM	ID	Cust	Date	Level
1	A200	A1	16/08/2021	2
2	A300	B1	15/09/2021	2
2	A300	B2	15/09/2021	2
3	A400	A1	18/01/2021	1
4	A500	A1	14/07/2021	3
4	A500	B1	14/07/2021	3
4	A500	B2	14/07/2021	3
4	A500	C1	14/07/2021	3
4	A600	D1	13/06/2021	3

wdx223_Daniel · ‎09-19-2021

AliRade · ‎09-20-2021

Thanks, worked as desired, and without using the 'date' data. 👍

Jakinta · ‎09-18-2021

You can try this.

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("jdBJCsAgDADAv3gWjHGpPVafIf7/GzWBQqylenLJkK1WZZVWFwLQQXcbDSSDgPRA1fQ3OR5imSDFHJPMv8HAOWaRBNeEs7ipkCPiRS/JgB2Jp1gQxIss7k3yBsFN0tuN40RTIfwjZd1LWWfhoftSwIu9tBs=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ITEM = _t, ID = _t, Cust = _t, Date = _t, Level = _t]),
    ChangedTypeLocale = Table.TransformColumnTypes(Source, {{"Date", type date}}, "hr-HR"),
    ChangedType = Table.TransformColumnTypes(ChangedTypeLocale,{{"Level", type number}}),
    Grouped = Table.Group(ChangedType, {"ID", "Cust"}, {{"Gr", each Table.FirstN(Table.Sort(_, {{"Date", Order.Descending}, {"Level", Order.Descending}}), 1), type table [ITEM=nullable text, ID=nullable text, Cust=nullable text, Date=nullable date, Level=nullable number]}}),
    Expanded = Table.ExpandTableColumn(Grouped, "Gr", {"Date", "Level"})
in
    Expanded

AliRade · ‎09-19-2021

Thanks Jakinta 🙂

This gives me the desired output.

However I'm wanting to avoid using 'Date' as one of the sorting fields (because some of the dates are null or older dates for latest level).

Will try to modify this so that only grouping by Cust and Level

Power Query Editor - Remove Duplicates (selective column)

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024