Select Distinct Rows Ordered by Another Column -- ...

AlexisOlson · ‎10-18-2021

Consider a table like this:

ID	Year	Value	Tag
1	2016	250	C
1	2019	190	B
2	2017	335	D
2	2018	155	A
2	2018	155	A
3	2021	230	B
3	2000	105	D
4	2019	160	C

Here's the M query in case you'd like to follow along in your own query editor:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUTIyMDQDUaYGQNJZKVYHLmwJpAwtQcJOYGEjiLA5kDI2NgWSLsjCFiDVpiBhR3zCxmBhI7AVxgizIcIGIAFDA4TZJkguMYM6MBYA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Year = _t, Value = _t, Tag = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Year", Int64.Type}, {"Value", type number}, {"Tag", type text}})
in
    #"Changed Type"

The goal is to select only the rows that correspond to the most recent year, one for each distinct ID.

ID	Year	Value	Tag
1	2019	190	B
2	2018	155	A
3	2021	230	B
4	2019	160	C

Approach #1: Create a new column then filter

If you're in the mindset of DAX, it might occur to you to create a calculated column [IsMaxYear] to check, for each row, whether or not the current row matches the max year for the ID in the current row and then filter for [IsMaxYear] = TRUE.

We can indeed do this by adding a custom column and filter steps like this:

= Table.AddColumn(
    #"Changed Type",
    "IsMaxYear",
    (r) => List.Max(
               Table.SelectRows(#"Changed Type", each [ID] = r[ID])[Year]
           ) = r[Year],
    type logical
  ),
= Table.SelectRows(#"Added Custom", each ([IsMaxYear] = true))

Note: I'm choosing to omit a detailed explanation of how this function works since it takes us off-topic and isn't a good solution anyway.

There are a couple of major drawbacks to this approach. First, it requires more than a surface-level knowledge of M to fully understand. Second, it's slow and inefficient for large data sets since for every single row, it's doing operations on the entire table from the previous step.

Approach #2: Sort and remove duplicates

If you were following along in your own query editor for the prior approach, you might have noticed that it did not eliminate duplicates where ties were involved (i.e. ID = 2 still had two identical rows even after filtering). So why not start by removing duplicates since we might need to use it anyway?

Using the GUI Home tab > Remove Rows > Remove Duplicates on the ID column allows us to select only the first appearance of each ID. Since the first one isn't necessarily the one we want, we need to sort by ID and Year first to get this to work.

We can attempt this as follows:

#"Sorted Rows" = Table.Sort(#"Changed Type",{{"ID", Order.Ascending}, {"Year", Order.Descending}}),
#"Removed Duplicates" = Table.Distinct(#"Sorted Rows", {"ID"})

However, this does not always work. For technical reasons (see here and here for a bit more discussion), Table.Distinct doesn't actually apply the sort to the table internally unless we force it to. Luckily, we can use Table.Buffer to force it to save the table to memory after sorting so that it does work.

We can either add Table.Buffer as a new intermediate step or wrap it around the definition of or reference to #"Sorted Rows". I'm arbitrarily picking the last option here:

#"Sorted Rows" = Table.Sort(#"Changed Type",{{"ID", Order.Ascending}, {"Year", Order.Descending}}),
#"Removed Duplicates" = Table.Distinct(Table.Buffer(#"Sorted Rows"), {"ID"})

This is a decent method. However, the drawback is that it requires Table.Buffer, which means we have to load the whole table into memory and break query folding that would occur if we were querying from another source like a SQL Server Database.

Approach #3: Custom Group By

The basis for this approach starts with a step generated by Home > Group By.

Note: I discovered this method when someone linked me to this article.

The corresponding M code looks like this:

= Table.Group(#"Changed Type", {"ID"}, {{"Year", each List.Max([Year]), type nullable number}, {"Tbl", each _, type table}})

Each of the tables in the Tbl column is the subtable corresponding to the ID in that row. For example, the preview for the second row looks like this:

This is a powerful approach that leads to a variety of possibilities for working with the Tbl column.

We can modify last part of the GUI-generated code for the Tbl column to only take the top row for each table ordered by Year, in which case, the result is a Record, rather than a table:

{"Tbl", each Table.Max(_, "Year"), type record}

From here, we can click the expand button in the upper right and choose the ones we want.

= Table.ExpandRecordColumn(#"Grouped Rows", "Tbl", {"Value", "Tag"}, {"Value", "Tag"})

In the special case where you only have a single column to expand, then you can combine these two steps:

#"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"Year", each List.Max([Year]), type number}, {"Tbl", each Table.Max(_, "Year"), type record}}),
#"Expanded Tbl" = Table.ExpandTableColumn(#"Grouped Rows", "Tbl", {"Value"}, {"Value"})

into a single step by specifying which column to return in the grouping step (by appending the column name after Table.Max):

Table.Group(#"Changed Type", {"ID"}, {{"Year", each List.Max([Year]), type number}, {"Value", each Table.Max(_, "Year")[Value], type number}})

This method is also decent but, like the prior method, also breaks query folding, which can be a deal-breaker if you're loading large tables.

Approach #4: Group by and merge with an earlier step

If you work with SQL frequently, this might be what you immediately think of. As with the previous post, we start with a standard Group By on ID, taking the maximum over Year (and nothing else).

This time, instead of tinkering with a column of tables, we'll do a self-merge. Click Home > Merge Queries > Merge Queries, hold down Ctrl or Shift to select both columns, and choose to merge with the current table (both columns again).

This will generate a step with M code like this:

= Table.NestedJoin(#"Grouped Rows", {"ID", "Year"}, #"Grouped Rows", {"ID", "Year"}, "Grouped Rows", JoinKind.LeftOuter)

What we actually want though is to merge with the step from before the grouping, so we modify the second table reference to

= Table.NestedJoin(#"Grouped Rows", {"ID", "Year"}, #"Changed Type", {"ID", "Year"}, "Grouped Rows", JoinKind.LeftOuter)

Note: This self-merge (self-join) and modify the step reference is handy for all sorts of queries where you want a common table expression without defining a separate query.

This is similar to the situation in the previous approach except that we haven't broken query folding (yet) and duplicate rows have not (yet) been resolved:

If (unlike the above) there are no ties (duplicates) anywhere, then all that remains is to expand the columns we want.

= Table.ExpandTableColumn(#"Merged Queries", "Grouped Rows", {"Value", "Tag"}, {"Value", "Tag"})

If there are duplicates, we can use Remove Rows > Remove Duplicates on the first two columns as a final step. This does break query folding, unlike Remove Duplicates performed on all of the columns. This might not be a problem since we're working with a smaller result than we started with.

We can preserve query folding using Group By again, grouping on ID and Year, and aggregating the remaining columns as we wish (I used Max).

= Table.Group(#"Expanded Grouped Rows", {"ID", "Year"}, {{"Value", each List.Max([Value]), type number}, {"Tag", each List.Max([Tag]), type text}})

This last approach, while not the simplest, does work better than the previous ones when loading a large table from a SQL Server since it can be folded into a single SQL statement that is evaluated on the server rather than within the query editor.

Right-clicking on the last step and choosing View Native Query, you'll see how it's interpreted in SQL (assuming, of course, you connected to a SQL source rather than using my example data):

select [rows].[ID2] as [ID],
    [rows].[Year] as [Year],
    max([rows].[Amount]) as [Amount],
    max([rows].[Tag]) as [Tag]
from 
(
    select [$Outer].[ID2],
        [$Outer].[Year],
        [$Inner].[Amount],
        [$Inner].[Tag]
    from 
    (
        select [rows].[ID] as [ID2],
            max([rows].[Year]) as [Year]
        from [dbo].[TableName] as [rows]
        group by [ID]
    ) as [$Outer]
    left outer join [dbo].[TableName] as [$Inner] on (([$Outer].[ID2] = [$Inner].[ID] or [$Outer].[ID2] is null and [$Inner].[ID] is null) and [$Outer].[Year] = [$Inner].[Year])
) as [rows]
group by [ID2], [Year]

Here's the full M code for this last version:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUTIyMDQDUaYGQNJZKVYHLmwJpAwtQcJOYGEjiLA5kDI2NgWSLsjCFiDVpiBhR3zCxmBhI7AVxgizIcIGIAFDA4TZJkguMYM6MBYA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Year = _t, Value = _t, Tag = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Year", Int64.Type}, {"Value", type number}, {"Tag", type text}}),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"Year", each List.Max([Year]), type nullable number}}),
    #"Merged Queries" = Table.NestedJoin(#"Grouped Rows", {"ID", "Year"}, #"Changed Type", {"ID", "Year"}, "Grouped Rows", JoinKind.LeftOuter),
    #"Expanded Grouped Rows" = Table.ExpandTableColumn(#"Merged Queries", "Grouped Rows", {"Value", "Tag"}, {"Value", "Tag"}),
    #"Grouped Rows1" = Table.Group(#"Expanded Grouped Rows", {"ID", "Year"}, {{"Value", each List.Max([Value]), type nullable number}, {"Tag", each List.Max([Tag]), type nullable text}})
in
    #"Grouped Rows1"

Please let me know of other sources and methods you've seen or written to solve this problem. I may expand this post if I come across other interesting methods.

References:

https://stackoverflow.com/questions/51437962

https://exceleratorbi.com.au/remove-duplicates-keep-last-record-power-query/

https://www.ehansalytics.com/blog/2020/7/16/return-row-based-on-max-value-from-one-column-when-group...

Related Posts:

Selectively remove duplicate rows?

Remove duplicates - keeping the most recent row

Removing duplicates, keeping rows where a field holds a certain value

Remove duplicates by prioritizing rows based on another column

How to return the most recent value of a second column based on a first column

Power Query Editor - Remove Duplicates (selective column)

AlexisOlson · ‎11-02-2021

Addendum: Approach #3 can easily be modified to return the top N rows, not just the top 1, using Table.MaxN

See here:

https://community.powerbi.com/t5/Desktop/Filter-out-top-3-rows-based-on-same-timestamp/m-p/2159050

ErnoNykanen · ‎08-02-2022

Thank you for this great blog post! I tried instinctively the sorting method #2 first and was wondering why it didn't work as intended but the buffer update worked like a charm.

agusmba · ‎07-26-2023

Thank you for sharing such a great article!

This is a very common problem, and your exposition of the different alternatives and recommendations is very educational.

This should be mandatory reading for anyone wanting to learn Power Query.

Select Distinct Rows Ordered by Another Column -- Power Query Edition

How to provide sample data in the Power BI Forum

Re: Power BI Tip: How To Sort A Column With Duplic...

Enhanced Data Profiling in Power Query: GUI and Ta...

Important Links every Power BI enthusiastic shoul...

Creating a “custom” or “hybrid” matrix in PowerBI