Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
Anonymous
Not applicable

Power Query append ,removing duplicate rows

To whom it may concern:

 

I am using power query to combine several different files, all the columns are the same, so it's straight forward in that context. Each row in every file is unique on an ID number, however, file over file there may be duplicates as the row may have been updated between dumps. What I would like is to be able to remove the row from the older file.

 

The data looks something like this:

File 1 - January 1

IDApplication Received DateApplication Assessed DateFunding Amount
1December 20, 2020December 21, 2020$400
2December 22, 2020December 22, 2020$500
3December 22, 2020  
4October 2, 2020October 4, 2020$1000

 

File 2 - January 30

IDApplication Received DateApplication Assessed Date Funding Amount
3December 22, 2020December 24, 2020$300
4November 1, 2020November 4, 2020$300
5November 3, 2020NOvember 5, 2020$500
6November 9, 2020November 11, 2020$600

 

All columns can change information except the ID, so I would not be able to group on anything. In the end, I would like to keep the ID files that are common to be from the more recent file, i.e. in the example above, I would like to keep ID 1 and 2 from file 1 and thne 3, 4, 5, and 6 from the second one.

 

I think the first step is to add in the date of the file as a column, and then group on the ID and date the file was created, but I don't know what to do from there.

 

Any help would be appreciated.

 

Thanks in advance,

 

Van

1 ACCEPTED SOLUTION
Jimmy801
Community Champion
Community Champion

Hello @Anonymous 

 

what you can do is to take the older file, join it with the newer file using a LeftAnti-join (IDs from the newer file will be removed) and then combining with the newer file. If you have a lot of files this could be some challengen. Another option could be to integrate the creation date of the file or another information that gives you somehow a information of time and then group by id, then taking only the date with max value. Here an example on what I mean

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUXJJTU7NTUotUjAy0AFiIwMUMUO4mIqJAYgC8gz1DUBIKVYnWskIRbURFhMQYiqm2EwwxqFaAYrRlJsAhfyTS/LBquGKYSImCMsMDUixDSGGZIQxignGCAf45ZdBVCOCBy5ESL8psmJjhH5/qJAprvCC6jdD1m+Jab8hUpSZYRgQCwA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, #"Application Received Date" = _t, #"Application Assessed Date" = _t, #"Funding Amount" = _t, #"Creation date" = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Application Received Date", type date}, {"Application Assessed Date", type date}, {"Funding Amount", type text}, {"Creation date", type date}}),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"MaxCreationDate", each Table.Max(_,"Creation date")}}),
    #"Expanded MaxCreationDate" = Table.ExpandRecordColumn(#"Grouped Rows", "MaxCreationDate", {"Application Received Date", "Application Assessed Date", "Funding Amount", "Creation date"}, {"Application Received Date", "Application Assessed Date", "Funding Amount", "Creation date"})
in
    #"Expanded MaxCreationDate"

transforms this

Jimmy801_0-1613639105800.png

into this

Jimmy801_1-1613639155130.png

Copy paste this code to the advanced editor in a new blank query to see how the solution works.

If this post helps or solves your problem, please mark it as solution (to help other users find useful content and to acknowledge the work of users that helped you)
Kudoes are nice too

Have fun

Jimmy

 

View solution in original post

4 REPLIES 4
Sergii24
Super User
Super User

Hi @Anonymous ,


What you're looking for is perfectly explained in this article https://radacad.com/grouping-in-power-query-getting-the-last-item-in-each-group. There is no much to add 🙂


Good luck!

P.S. I've used it by myself just few days ago and it worked exactly as expected! Radacad articles are always easy to follow.

Jimmy801
Community Champion
Community Champion

Hello @Anonymous 

 

what you can do is to take the older file, join it with the newer file using a LeftAnti-join (IDs from the newer file will be removed) and then combining with the newer file. If you have a lot of files this could be some challengen. Another option could be to integrate the creation date of the file or another information that gives you somehow a information of time and then group by id, then taking only the date with max value. Here an example on what I mean

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUXJJTU7NTUotUjAy0AFiIwMUMUO4mIqJAYgC8gz1DUBIKVYnWskIRbURFhMQYiqm2EwwxqFaAYrRlJsAhfyTS/LBquGKYSImCMsMDUixDSGGZIQxignGCAf45ZdBVCOCBy5ESL8psmJjhH5/qJAprvCC6jdD1m+Jab8hUpSZYRgQCwA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, #"Application Received Date" = _t, #"Application Assessed Date" = _t, #"Funding Amount" = _t, #"Creation date" = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Application Received Date", type date}, {"Application Assessed Date", type date}, {"Funding Amount", type text}, {"Creation date", type date}}),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"MaxCreationDate", each Table.Max(_,"Creation date")}}),
    #"Expanded MaxCreationDate" = Table.ExpandRecordColumn(#"Grouped Rows", "MaxCreationDate", {"Application Received Date", "Application Assessed Date", "Funding Amount", "Creation date"}, {"Application Received Date", "Application Assessed Date", "Funding Amount", "Creation date"})
in
    #"Expanded MaxCreationDate"

transforms this

Jimmy801_0-1613639105800.png

into this

Jimmy801_1-1613639155130.png

Copy paste this code to the advanced editor in a new blank query to see how the solution works.

If this post helps or solves your problem, please mark it as solution (to help other users find useful content and to acknowledge the work of users that helped you)
Kudoes are nice too

Have fun

Jimmy

 

Are the IDs incrementing over time, always?  If so, you could do a group by all other columns (the ones the same) and then do a max of the ID column. 



* Matt is an 8 times Microsoft MVP (Power BI) and author of the Power BI Book Supercharge Power BI.
CNENFRNL
Community Champion
Community Champion

@Anonymous , of coz grouping all combined records by ID is a most straightforward and efficient solution to your issue.

 

First, combine all files in a CHRONOLOGICAL order (important!);

Secondly, index the combined records;

Thirdly, group the records by ID and keep the record with the max index.

let
    File1 = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUXJJTU7NTUotUjAy0AFiIwMUMUO4mIqJgYFSrE60khGKAiMsmhBiKqZQTcY4FCiAMUiFCZDln1ySD1YAl4eJmCCMNDQAmRkLAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, #"Application Received Date" = _t, #"Application Assessed Date" = _t, #"Funding Amount" = _t]),
    File2 = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlbSUXJJTU7NTUotUjAy0lEwMjAyQBEzgYupGBsYKMXqRCuZADl++WUQBYZwebgQFi2myPLGCC3+UCFThBZTqBYzZC2WmLYYImxWMQPpiQUA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, #"Application Received Date" = _t, #"Application Assessed Date" = _t, #"Funding Amount" = _t]),
    #"Combined Files" = File1 & File2,
    Columns = Table.ColumnNames(#"Combined Files"),
    #"Added Index" = Table.AddIndexColumn(#"Combined Files", "Index", 1, 1, Int64.Type),
    #"Grouped Rows" = Table.Group(#"Added Index", {"ID"}, {{"ar", each _, type table [ID=nullable text, Application Received Date=nullable text, Application Assessed Date=nullable text, Funding Amount=nullable text, Index=number]}}),
    #"Removed Columns" = Table.TransformColumns(Table.RemoveColumns(#"Grouped Rows",{"ID"}), {{"ar", each Table.Last(Table.Sort(_, {"Index", Order.Ascending}))}}),
    #"Expanded ar" = Table.ExpandRecordColumn(#"Removed Columns", "ar", Columns, Columns)
in
    #"Expanded ar"

Screenshot 2021-02-18 095603.png


Thanks to the great efforts by MS engineers to simplify syntax of DAX! Most beginners are SUCCESSFULLY MISLED to think that they could easily master DAX; but it turns out that the intricacy of the most frequently used RANKX() is still way beyond their comprehension!

DAX is simple, but NOT EASY!

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors