cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
bamba98
Helper I
Helper I

Remove duplicates without changing values in calculated column

I have a bit of a tricky situation and I am not able to figure out how to solve it. So I have a huge data set (million rows) and many duplicate IDs. The duplicate IDs are within the table because it shows results over different years. Without the duplicates I should have around 100 000 rows. However, I've added some calculated columns to the table that sums values of an ID over all the years combined. 

 

Is there a way to remove the duplicate IDs without changing the result of the calculated columns?

1 ACCEPTED SOLUTION

I managed to solve this by uploading again the same data set under a different name. I've removed the duplicates in the second data set and used FIRSTNONBLANK function to add the calculated Tot column of the first data set to the second data set. A bit of an unconventional approach I would say, but it does the trick!

 

This idea was inspired by @az38. Thank you!

View solution in original post

7 REPLIES 7
az38
Super User
Super User

@bamba98 

if you created new columns with DAX expressions - no, dataset won't be removed correct in any case


do not hesitate to give a kudo to useful posts and mark solutions as solution
LinkedIn

I don't get your answer. Perhaps my problem was not very clear. Allow me to elaborate.

 

Consider the following example with just one ID:

bamba98_0-1594581393648.png

where Tot is calculated as follows: Tot = calculate(sum(val), allexcept(ID)).

 

I want to remove the duplicate IDs and just keep one ID without the Tot column to change. Now, if I remove the duplicate ID rows with edit query, I get that Tot=10, but I want the final result to look like this: 

bamba98_1-1594581589472.png

 

@bamba98 

what's rule for duplicates definition? why do you remove exactly 2019, 2018 and 2017 rows and not 2020?


do not hesitate to give a kudo to useful posts and mark solutions as solution
LinkedIn

@az38 Because I want to keep the latest period an ID was active and after removing the duplicates I don't care much about the column containing Val.

Hi @bamba98 ,

 

In Power Query Editor, you remove the duplicate ID rows, so they are not existed. You cannot get the Tot=45, because the real data is gone.

You need to get the Tot in Power Query Editor, and you can refer the following steps.

 

1. Group by ID.

 

R1.jpg

 

2. Add a custom column to get the max period.

 

R2.jpg

 

3. Expand the custom column.

 

R3.jpg

 

4. At last delete the Table column.

 

R4.jpg

 

R5.jpg

 

The complete M Query is as follows:

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUTIyMDIAUoYGSrE6MBFDSyBliixgAVKCImKOEDFCGIMsgGSKEZIpBsgiMFNiAQ==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Period = _t, Val = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Period", Int64.Type}, {"Val", Int64.Type}}),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"Table", each _, type table [ID=nullable number, Period=nullable number, Val=nullable number]}, {"New Tot", each List.Sum([Val]), type nullable number}}),
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom.1", each Table.Max([Table],"Period")),
    #"Expanded Custom.1" = Table.ExpandRecordColumn(#"Added Custom", "Custom.1", {"Period"}, {"Custom.1.Period"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expanded Custom.1",{"Table"})
in
    #"Removed Columns"

 

BTW, pbix as attached.

 

Best regards,

 

Community Support Team _ zhenbw

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

I managed to solve this by uploading again the same data set under a different name. I've removed the duplicates in the second data set and used FIRSTNONBLANK function to add the calculated Tot column of the first data set to the second data set. A bit of an unconventional approach I would say, but it does the trick!

 

This idea was inspired by @az38. Thank you!

View solution in original post

@bamba98 

try to create new calculated table like

Table 2 = 
ADDCOLUMNS(
SUMMARIZE('Table', 'Table'[ID], "Period", MAX('Table'[Period])),
"Val", CALCULATE(SUM('Table'[Val]), FILTER('Table', 'Table'[ID] = EARLIER([ID]) && 'Table'[Period] = EARLIER([Period]))),
"Tot", CALCULATE(SUM('Table'[Tot]), FILTER('Table', 'Table'[ID] = EARLIER([ID]) && 'Table'[Period] = EARLIER([Period])))

)

do not hesitate to give a kudo to useful posts and mark solutions as solution
LinkedIn

Helpful resources

Announcements
UG GA Amplification 768x460.png

Launching new user group features

Learn how to create your own user groups today!

Power BI October Update 2021.jpg

Power BI Release

Click here to read more about the October 2021 Release!

Community Connections 768x460.jpg

Community & How To Videos

Check out the new Power Platform Community Connections gallery!

Teds Dev Camp Oct. 2021 768x460.jpg

Power BI Dev Camp - October 28th, 2021

Mark your calendars and join us for our next Power BI Dev Camp!