Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Hi. I have been bugging witht this problem to count duplicates in multiple columns involving 12millions rows and counting. I am trying now with the approach which does not provide me the correct result.
Basically what I need is to check how maany from in the table have 5 similar entries with the current table
Table has 6 columns, and duplicate number does not need to be in the same column. As you can see my attemp below is to check only 1 duplicate (I checked only the first column agains 2 columns just to see if it works)
Any different approach involving measure is appreciated (calculated columns may either be slow or will result to memory issue).
Thanks.
=sumx( '3_13May', if(sumx('3_13May',SWITCH ( TRUE (), '3_13May'[Column1]=EARLIER('3_13May'[Column1]),1, '3_13May'[Column2]=EARLIER('3_13May'[Column1]),1,0))=1,1,0 ))
Hi @theo
You can achieve this by applying three steps in Query editor.
Please see the M code below based on the example that you provided.
let Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("dY+xDcQwDAN3cZ0iEpnEmSXw/mu8zm4ej3xBwQB5lPw8LdrWsqSSS0fpbGN7d67p9HrdpdgZ5IJg6J8LHsdqpSZOxsUgHcRzf/NzXkF7rvbknKQwSSbJnMn+69MqjhAlyukLX/iCF7zg1b98UIMa1Oz32m9QgxrUoOYTvtsYHw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [Column1 = _t, Column2 = _t, Column3 = _t, Column4 = _t, Column5 = _t, Column6 = _t]), #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", Int64.Type}, {"Column2", Int64.Type}, {"Column3", Int64.Type}, {"Column4", Int64.Type}, {"Column5", Int64.Type}, {"Column6", Int64.Type}}), #"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Changed Type", {}, "Attribute", "Value"), #"Grouped Rows" = Table.Group(#"Unpivoted Columns", {"Value"}, {{"Count", each Table.RowCount(_), type number}}), #"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each ([Count] <> 1)) in #"Filtered Rows"
hi @Mariusz Thanks but it doesnt provide the result that I need.
The code you provided just count the duplicate of all entry .
What I need is to count the rows that have duplicate (eg. 5 duplicates) based on the current row.
in my example under dup 5 count, line 1 has a count of 1 since there is one row that has 5 similar entry with line 1 by virtue of 1,2,3,4, and 5. same goes with line 2 while the rest has 0 meaning no row in the table that has 5 similar entry.
Hi @theo
Please see the below three tables, I believe that this will do whats required however I do not think it will perform very well with 12 million rows.
// yourSampleTable let Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("dY+xDcQwDAN3cZ0iEpnEmSXw/mu8zm4ej3xBwQB5lPw8LdrWsqSSS0fpbGN7d67p9HrdpdgZ5IJg6J8LHsdqpSZOxsUgHcRzf/NzXkF7rvbknKQwSSbJnMn+69MqjhAlyukLX/iCF7zg1b98UIMa1Oz32m9QgxrUoOYTvtsYHw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [Column1 = _t, Column2 = _t, Column3 = _t, Column4 = _t, Column5 = _t, Column6 = _t]), #"Added Index" = Table.AddIndexColumn(Source, "Index", 0, 1), #"Reordered Columns" = Table.ReorderColumns(#"Added Index",{"Index", "Column1", "Column2", "Column3", "Column4", "Column5", "Column6"}), #"Changed Type" = Table.TransformColumnTypes(#"Reordered Columns",{{"Index", Int64.Type}}) in #"Changed Type"
// duplicateCountTable let Source = yourSampleTable, #"Unpivoted Columns" = Table.UnpivotOtherColumns(Source, {"Index"}, "Column", "Value"), #"Merged Queries" = Table.NestedJoin(#"Unpivoted Columns", {"Value"}, #"Unpivoted Columns", {"Value"}, "Unpivoted Columns", JoinKind.Inner), #"Expanded Unpivoted Columns" = Table.ExpandTableColumn(#"Merged Queries", "Unpivoted Columns", {"Index"}, {"Index.1"}), #"Filtered Rows" = Table.SelectRows(#"Expanded Unpivoted Columns", each ([Index.1] <> [Index])), #"Grouped Rows" = Table.Group(#"Filtered Rows", {"Index", "Index.1"}, {{"Count", each "Dup " & Number.ToText( List.Count(_) ) & " Count", type text }}), #"Sorted Rows" = Table.Sort(#"Grouped Rows",{{"Count", Order.Descending}}), #"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Sorted Rows", {{"Count", type text}}, "en-GB"), List.Distinct(Table.TransformColumnTypes(#"Sorted Rows", {{"Count", type text}}, "en-GB")[Count]), "Count", "Index.1", List.Count) in #"Pivoted Column"
// mergeOutput let Source = Table.NestedJoin(yourSampleTable, {"Index"}, duplicateCountTable, {"Index"}, "duplicateCountTable", JoinKind.LeftOuter), #"Expanded duplicateCountTable" = Table.ExpandTableColumn(Source, "duplicateCountTable", {"Dup 5 Count", "Dup 4 Count", "Dup 3 Count", "Dup 2 Count", "Dup 1 Count"}, {"Dup 5 Count", "Dup 4 Count", "Dup 3 Count", "Dup 2 Count", "Dup 1 Count"}) in #"Expanded duplicateCountTable"
Hi @Mariusz Thanks. Will try this one. That is also my concern for using power query, it may not be able to work on the 12m rows. So Im thibking if measure maybe a better alternative.
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.
User | Count |
---|---|
113 | |
97 | |
79 | |
74 | |
56 |
User | Count |
---|---|
145 | |
105 | |
104 | |
90 | |
63 |