Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
bamba98
Helper I
Helper I

Remove duplicates without changing values in calculated column

I have a bit of a tricky situation and I am not able to figure out how to solve it. So I have a huge data set (million rows) and many duplicate IDs. The duplicate IDs are within the table because it shows results over different years. Without the duplicates I should have around 100 000 rows. However, I've added some calculated columns to the table that sums values of an ID over all the years combined. 

 

Is there a way to remove the duplicate IDs without changing the result of the calculated columns?

1 ACCEPTED SOLUTION

I managed to solve this by uploading again the same data set under a different name. I've removed the duplicates in the second data set and used FIRSTNONBLANK function to add the calculated Tot column of the first data set to the second data set. A bit of an unconventional approach I would say, but it does the trick!

 

This idea was inspired by @az38. Thank you!

View solution in original post

7 REPLIES 7
az38
Community Champion
Community Champion

@bamba98 

if you created new columns with DAX expressions - no, dataset won't be removed correct in any case


do not hesitate to give a kudo to useful posts and mark solutions as solution
LinkedIn

I don't get your answer. Perhaps my problem was not very clear. Allow me to elaborate.

 

Consider the following example with just one ID:

bamba98_0-1594581393648.png

where Tot is calculated as follows: Tot = calculate(sum(val), allexcept(ID)).

 

I want to remove the duplicate IDs and just keep one ID without the Tot column to change. Now, if I remove the duplicate ID rows with edit query, I get that Tot=10, but I want the final result to look like this: 

bamba98_1-1594581589472.png

 

az38
Community Champion
Community Champion

@bamba98 

what's rule for duplicates definition? why do you remove exactly 2019, 2018 and 2017 rows and not 2020?


do not hesitate to give a kudo to useful posts and mark solutions as solution
LinkedIn

@az38 Because I want to keep the latest period an ID was active and after removing the duplicates I don't care much about the column containing Val.

Hi @bamba98 ,

 

In Power Query Editor, you remove the duplicate ID rows, so they are not existed. You cannot get the Tot=45, because the real data is gone.

You need to get the Tot in Power Query Editor, and you can refer the following steps.

 

1. Group by ID.

 

R1.jpg

 

2. Add a custom column to get the max period.

 

R2.jpg

 

3. Expand the custom column.

 

R3.jpg

 

4. At last delete the Table column.

 

R4.jpg

 

R5.jpg

 

The complete M Query is as follows:

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUTIyMDIAUoYGSrE6MBFDSyBliixgAVKCImKOEDFCGIMsgGSKEZIpBsgiMFNiAQ==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Period = _t, Val = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Period", Int64.Type}, {"Val", Int64.Type}}),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"Table", each _, type table [ID=nullable number, Period=nullable number, Val=nullable number]}, {"New Tot", each List.Sum([Val]), type nullable number}}),
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom.1", each Table.Max([Table],"Period")),
    #"Expanded Custom.1" = Table.ExpandRecordColumn(#"Added Custom", "Custom.1", {"Period"}, {"Custom.1.Period"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expanded Custom.1",{"Table"})
in
    #"Removed Columns"

 

BTW, pbix as attached.

 

Best regards,

 

Community Support Team _ zhenbw

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

I managed to solve this by uploading again the same data set under a different name. I've removed the duplicates in the second data set and used FIRSTNONBLANK function to add the calculated Tot column of the first data set to the second data set. A bit of an unconventional approach I would say, but it does the trick!

 

This idea was inspired by @az38. Thank you!

az38
Community Champion
Community Champion

@bamba98 

try to create new calculated table like

Table 2 = 
ADDCOLUMNS(
SUMMARIZE('Table', 'Table'[ID], "Period", MAX('Table'[Period])),
"Val", CALCULATE(SUM('Table'[Val]), FILTER('Table', 'Table'[ID] = EARLIER([ID]) && 'Table'[Period] = EARLIER([Period]))),
"Tot", CALCULATE(SUM('Table'[Tot]), FILTER('Table', 'Table'[ID] = EARLIER([ID]) && 'Table'[Period] = EARLIER([Period])))

)

do not hesitate to give a kudo to useful posts and mark solutions as solution
LinkedIn

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.