Solved: Re: Power Query append ,removing duplicate rows

Anonymous · ‎02-17-2021

To whom it may concern:

I am using power query to combine several different files, all the columns are the same, so it's straight forward in that context. Each row in every file is unique on an ID number, however, file over file there may be duplicates as the row may have been updated between dumps. What I would like is to be able to remove the row from the older file.

The data looks something like this:

File 1 - January 1

ID	Application Received Date	Application Assessed Date	Funding Amount
1	December 20, 2020	December 21, 2020	$400
2	December 22, 2020	December 22, 2020	$500
3	December 22, 2020
4	October 2, 2020	October 4, 2020	$1000

File 2 - January 30

ID	Application Received Date	Application Assessed Date	Funding Amount
3	December 22, 2020	December 24, 2020	$300
4	November 1, 2020	November 4, 2020	$300
5	November 3, 2020	NOvember 5, 2020	$500
6	November 9, 2020	November 11, 2020	$600

All columns can change information except the ID, so I would not be able to group on anything. In the end, I would like to keep the ID files that are common to be from the more recent file, i.e. in the example above, I would like to keep ID 1 and 2 from file 1 and thne 3, 4, 5, and 6 from the second one.

I think the first step is to add in the date of the file as a column, and then group on the ID and date the file was created, but I don't know what to do from there.

Any help would be appreciated.

Thanks in advance,

Van

Jimmy801 · ‎02-18-2021

Hello @Anonymous

what you can do is to take the older file, join it with the newer file using a LeftAnti-join (IDs from the newer file will be removed) and then combining with the newer file. If you have a lot of files this could be some challengen. Another option could be to integrate the creation date of the file or another information that gives you somehow a information of time and then group by id, then taking only the date with max value. Here an example on what I mean

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUXJJTU7NTUotUjAy0AFiIwMUMUO4mIqJAYgC8gz1DUBIKVYnWskIRbURFhMQYiqm2EwwxqFaAYrRlJsAhfyTS/LBquGKYSImCMsMDUixDSGGZIQxignGCAf45ZdBVCOCBy5ESL8psmJjhH5/qJAprvCC6jdD1m+Jab8hUpSZYRgQCwA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, #"Application Received Date" = _t, #"Application Assessed Date" = _t, #"Funding Amount" = _t, #"Creation date" = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Application Received Date", type date}, {"Application Assessed Date", type date}, {"Funding Amount", type text}, {"Creation date", type date}}),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"MaxCreationDate", each Table.Max(_,"Creation date")}}),
    #"Expanded MaxCreationDate" = Table.ExpandRecordColumn(#"Grouped Rows", "MaxCreationDate", {"Application Received Date", "Application Assessed Date", "Funding Amount", "Creation date"}, {"Application Received Date", "Application Assessed Date", "Funding Amount", "Creation date"})
in
    #"Expanded MaxCreationDate"

transforms this

into this

Copy paste this code to the advanced editor in a new blank query to see how the solution works.

If this post helps or solves your problem, please mark it as solution (to help other users find useful content and to acknowledge the work of users that helped you)
Kudoes are nice too

Have fun

Jimmy

View solution in original post

Sergii24 · ‎02-18-2021

Hi @Anonymous ,

What you're looking for is perfectly explained in this article https://radacad.com/grouping-in-power-query-getting-the-last-item-in-each-group. There is no much to add 🙂

Good luck!

P.S. I've used it by myself just few days ago and it worked exactly as expected! Radacad articles are always easy to follow.

Jimmy801 · ‎02-18-2021

Hello @Anonymous

what you can do is to take the older file, join it with the newer file using a LeftAnti-join (IDs from the newer file will be removed) and then combining with the newer file. If you have a lot of files this could be some challengen. Another option could be to integrate the creation date of the file or another information that gives you somehow a information of time and then group by id, then taking only the date with max value. Here an example on what I mean

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUXJJTU7NTUotUjAy0AFiIwMUMUO4mIqJAYgC8gz1DUBIKVYnWskIRbURFhMQYiqm2EwwxqFaAYrRlJsAhfyTS/LBquGKYSImCMsMDUixDSGGZIQxignGCAf45ZdBVCOCBy5ESL8psmJjhH5/qJAprvCC6jdD1m+Jab8hUpSZYRgQCwA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, #"Application Received Date" = _t, #"Application Assessed Date" = _t, #"Funding Amount" = _t, #"Creation date" = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Application Received Date", type date}, {"Application Assessed Date", type date}, {"Funding Amount", type text}, {"Creation date", type date}}),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"MaxCreationDate", each Table.Max(_,"Creation date")}}),
    #"Expanded MaxCreationDate" = Table.ExpandRecordColumn(#"Grouped Rows", "MaxCreationDate", {"Application Received Date", "Application Assessed Date", "Funding Amount", "Creation date"}, {"Application Received Date", "Application Assessed Date", "Funding Amount", "Creation date"})
in
    #"Expanded MaxCreationDate"

transforms this

into this

Copy paste this code to the advanced editor in a new blank query to see how the solution works.

If this post helps or solves your problem, please mark it as solution (to help other users find useful content and to acknowledge the work of users that helped you)
Kudoes are nice too

Have fun

Jimmy

MattAllington · ‎02-18-2021

Are the IDs incrementing over time, always? If so, you could do a group by all other columns (the ones the same) and then do a max of the ID column.

* Matt is an 8 times Microsoft MVP (Power BI) and author of the Power BI Book Supercharge Power BI.

CNENFRNL · ‎02-18-2021

@Anonymous , of coz grouping all combined records by ID is a most straightforward and efficient solution to your issue.

First, combine all files in a CHRONOLOGICAL order (important!);

Secondly, index the combined records;

Thirdly, group the records by ID and keep the record with the max index.

let
    File1 = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUXJJTU7NTUotUjAy0AFiIwMUMUO4mIqJgYFSrE60khGKAiMsmhBiKqZQTcY4FCiAMUiFCZDln1ySD1YAl4eJmCCMNDQAmRkLAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, #"Application Received Date" = _t, #"Application Assessed Date" = _t, #"Funding Amount" = _t]),
    File2 = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlbSUXJJTU7NTUotUjAy0lEwMjAyQBEzgYupGBsYKMXqRCuZADl++WUQBYZwebgQFi2myPLGCC3+UCFThBZTqBYzZC2WmLYYImxWMQPpiQUA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, #"Application Received Date" = _t, #"Application Assessed Date" = _t, #"Funding Amount" = _t]),
    #"Combined Files" = File1 & File2,
    Columns = Table.ColumnNames(#"Combined Files"),
    #"Added Index" = Table.AddIndexColumn(#"Combined Files", "Index", 1, 1, Int64.Type),
    #"Grouped Rows" = Table.Group(#"Added Index", {"ID"}, {{"ar", each _, type table [ID=nullable text, Application Received Date=nullable text, Application Assessed Date=nullable text, Funding Amount=nullable text, Index=number]}}),
    #"Removed Columns" = Table.TransformColumns(Table.RemoveColumns(#"Grouped Rows",{"ID"}), {{"ar", each Table.Last(Table.Sort(_, {"Index", Order.Ascending}))}}),
    #"Expanded ar" = Table.ExpandRecordColumn(#"Removed Columns", "ar", Columns, Columns)
in
    #"Expanded ar"

Thanks to the great efforts by MS engineers to simplify syntax of DAX! Most beginners are SUCCESSFULLY MISLED to think that they could easily master DAX; but it turns out that the intricacy of the most frequently used RANKX() is still way beyond their comprehension!

DAX is simple, but NOT EASY!

Power Query append ,removing duplicate rows

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024