Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!
I have a bit of a tricky situation and I am not able to figure out how to solve it. So I have a huge data set (million rows) and many duplicate IDs. The duplicate IDs are within the table because it shows results over different years. Without the duplicates I should have around 100 000 rows. However, I've added some calculated columns to the table that sums values of an ID over all the years combined.
Is there a way to remove the duplicate IDs without changing the result of the calculated columns?
Solved! Go to Solution.
I managed to solve this by uploading again the same data set under a different name. I've removed the duplicates in the second data set and used FIRSTNONBLANK function to add the calculated Tot column of the first data set to the second data set. A bit of an unconventional approach I would say, but it does the trick!
This idea was inspired by @az38. Thank you!
I don't get your answer. Perhaps my problem was not very clear. Allow me to elaborate.
Consider the following example with just one ID:
where Tot is calculated as follows: Tot = calculate(sum(val), allexcept(ID)).
I want to remove the duplicate IDs and just keep one ID without the Tot column to change. Now, if I remove the duplicate ID rows with edit query, I get that Tot=10, but I want the final result to look like this:
@az38 Because I want to keep the latest period an ID was active and after removing the duplicates I don't care much about the column containing Val.
Hi @bamba98 ,
In Power Query Editor, you remove the duplicate ID rows, so they are not existed. You cannot get the Tot=45, because the real data is gone.
You need to get the Tot in Power Query Editor, and you can refer the following steps.
1. Group by ID.
2. Add a custom column to get the max period.
3. Expand the custom column.
4. At last delete the Table column.
The complete M Query is as follows:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUTIyMDIAUoYGSrE6MBFDSyBliixgAVKCImKOEDFCGIMsgGSKEZIpBsgiMFNiAQ==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Period = _t, Val = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Period", Int64.Type}, {"Val", Int64.Type}}),
#"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"Table", each _, type table [ID=nullable number, Period=nullable number, Val=nullable number]}, {"New Tot", each List.Sum([Val]), type nullable number}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom.1", each Table.Max([Table],"Period")),
#"Expanded Custom.1" = Table.ExpandRecordColumn(#"Added Custom", "Custom.1", {"Period"}, {"Custom.1.Period"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Custom.1",{"Table"})
in
#"Removed Columns"
BTW, pbix as attached.
Best regards,
Community Support Team _ zhenbw
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
I managed to solve this by uploading again the same data set under a different name. I've removed the duplicates in the second data set and used FIRSTNONBLANK function to add the calculated Tot column of the first data set to the second data set. A bit of an unconventional approach I would say, but it does the trick!
This idea was inspired by @az38. Thank you!
try to create new calculated table like
Table 2 =
ADDCOLUMNS(
SUMMARIZE('Table', 'Table'[ID], "Period", MAX('Table'[Period])),
"Val", CALCULATE(SUM('Table'[Val]), FILTER('Table', 'Table'[ID] = EARLIER([ID]) && 'Table'[Period] = EARLIER([Period]))),
"Tot", CALCULATE(SUM('Table'[Tot]), FILTER('Table', 'Table'[ID] = EARLIER([ID]) && 'Table'[Period] = EARLIER([Period])))
)
User | Count |
---|---|
140 | |
113 | |
104 | |
77 | |
65 |
User | Count |
---|---|
136 | |
117 | |
101 | |
71 | |
61 |