Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Hi Community,
I have a dataset where I want to remove 'selective duplicates'. I've tried it by sorting the column order and then removing duplicates (via a Buffer table) but that removes all Duplicates (even though i want to keep some).
Example dataset:
ITEM | ID | Cust | Date | Level |
1 | A200 | A1 | 16/08/2021 | 2 |
1 | A200 | A1 | 16/07/2021 | 1 |
2 | A300 | B1 | 15/09/2021 | 2 |
2 | A300 | B2 | 15/09/2021 | 2 |
2 | A300 | B1 | 13/07/2021 | 1 |
3 | A400 | A1 | 18/01/2021 | 1 |
4 | A500 | A1 | 14/07/2021 | 3 |
4 | A500 | B1 | 14/07/2021 | 3 |
4 | A500 | B2 | 14/07/2021 | 3 |
4 | A500 | B2 | 13/06/2021 | 2 |
4 | A500 | B1 | 12/06/2021 | 2 |
4 | A500 | C1 | 14/07/2021 | 3 |
4 | A500 | C1 | 12/06/2021 | 2 |
4 | A500 | A1 | 01/04/2021 | 1 |
The Output I'm trying to get from the above Dataset is that for EACH "ID", output (or keep) the rows which has the latest "Level" number (even if the latest Level number for a particular "ID" is appearing more than once). So wanting the output table to look like below.
ITEM | ID | Cust | Date | Level |
1 | A200 | A1 | 16/08/2021 | 2 |
2 | A300 | B1 | 15/09/2021 | 2 |
2 | A300 | B2 | 15/09/2021 | 2 |
3 | A400 | A1 | 18/01/2021 | 1 |
4 | A500 | A1 | 14/07/2021 | 3 |
4 | A500 | B1 | 14/07/2021 | 3 |
4 | A500 | B2 | 14/07/2021 | 3 |
4 | A500 | C1 | 14/07/2021 | 3 |
What I've done is a sort (descending) on Colum "Level" and then remove Duplicate Rows Column "ID". But this way removes all duplicates in "ID" even though I would like to keep some (for example for ID=A500, i want to keep all rows that have the latest Level).
Any ideas on how to go about this? I've tried looking into 'grouping' but not getting the output i want.
Thanks in advance for any suggestions.
Solved! Go to Solution.
Here is how you can accomplish this. I have added comments in the script below that explains what's going on.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("jdBJCsAgDADAv3gWjHGpPVafIf7/GzWBQqylenLJkK1WZZVWFwLQQXcbDSSDgPRA1fQ3OR5imSDFHJPMv8HAOWaRBNeEs7ipkCPiRS/JgB2Jp1gQxIss7k3yBsFN0tuN40RTIfwjZd1LWWfhoftSwIu9tBs=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ITEM = _t, ID = _t, Cust = _t, Date = _t, Level = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"ITEM", Int64.Type}, {"ID", type text}, {"Cust", type text}, {"Date", type text}, {"Level", Int64.Type}}),
// Add a column with the date formatted to m/d/yyyy.
// You could transform the existing date column instead if desired,
// but adding as a new column allows you to retain the existing format in the final table.
FormattedDate = Table.AddColumn(#"Changed Type", "Formatted Date", each Date.From ( Text.Combine({Text.Middle([Date], 4, 2), Text.Start([Date], 2), Text.Middle([Date], 5)}) ), type date),
Grouped = Table.Group(FormattedDate, {"ITEM", "ID", "Cust"}, {{"Max Date", each List.Max([Formatted Date]), type date}, {"All", each _, type table [ITEM=nullable number, ID=nullable text, Cust=nullable text, Date=nullable text, Level=nullable number, Formatted Date=date]}}),
AddLevel = Table.AddColumn ( Grouped, "Level", each
// Create variables
let
// max date of each row in the Grouped table
varMaxDate = [Max Date],
// Nested table filtered
FilterGroupedTables = Table.SelectRows ( [All], each varMaxDate = _[Formatted Date] )
in
// Return the value of the Level column in the nested table.
FilterGroupedTables[Level]{0}, Int64.Type
),
RemoveHelperColumns = Table.RemoveColumns(AddLevel,{"Max Date", "All"})
in
RemoveHelperColumns
Thanks for the detailed solution @jennratten . This worked perfectly.
However, I'm trying to avoid using 'Date' as some of the dates are null, and some of the latest 'Level' have an older date (sorting by max date would then not be an accurate indicator).
Ideally, I want to only group by Cust and Level. This will then output all IDs that have the latest 'Level' regardless of taking 'Date' into account.
Example, added Cust "D1"
ITEM | ID | Cust | Date | Level |
1 | A200 | A1 | 16/08/2021 | 2 |
2 | A300 | B1 | 15/09/2021 | 2 |
2 | A300 | B2 | 15/09/2021 | 2 |
3 | A400 | A1 | 18/01/2021 | 1 |
4 | A500 | A1 | 14/07/2021 | 3 |
4 | A500 | B1 | 14/07/2021 | 3 |
4 | A500 | B2 | 14/07/2021 | 3 |
4 | A500 | C1 | 14/07/2021 | 3 |
4 | A600 | D1 | 13/06/2021 | 3 |
Thanks, worked as desired, and without using the 'date' data. 👍
You can try this.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("jdBJCsAgDADAv3gWjHGpPVafIf7/GzWBQqylenLJkK1WZZVWFwLQQXcbDSSDgPRA1fQ3OR5imSDFHJPMv8HAOWaRBNeEs7ipkCPiRS/JgB2Jp1gQxIss7k3yBsFN0tuN40RTIfwjZd1LWWfhoftSwIu9tBs=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ITEM = _t, ID = _t, Cust = _t, Date = _t, Level = _t]),
ChangedTypeLocale = Table.TransformColumnTypes(Source, {{"Date", type date}}, "hr-HR"),
ChangedType = Table.TransformColumnTypes(ChangedTypeLocale,{{"Level", type number}}),
Grouped = Table.Group(ChangedType, {"ID", "Cust"}, {{"Gr", each Table.FirstN(Table.Sort(_, {{"Date", Order.Descending}, {"Level", Order.Descending}}), 1), type table [ITEM=nullable text, ID=nullable text, Cust=nullable text, Date=nullable date, Level=nullable number]}}),
Expanded = Table.ExpandTableColumn(Grouped, "Gr", {"Date", "Level"})
in
Expanded
Thanks Jakinta 🙂
This gives me the desired output.
However I'm wanting to avoid using 'Date' as one of the sorting fields (because some of the dates are null or older dates for latest level).
Will try to modify this so that only grouping by Cust and Level
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.