Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
theo
Helper III
Helper III

Count duplicate values using switch or if statement

Hi.  I have been bugging witht this problem to count duplicates in multiple columns involving 12millions rows and counting.  I am trying now with the approach which does not provide me the correct result.

Basically what I need is to check how maany from in the table have 5 similar entries with the current table
Table has 6 columns, and duplicate number does not need to be in the same column.  As you can see my attemp below is to check only 1 duplicate (I checked only the first column agains 2 columns just to see if it works)

Any different approach involving measure is appreciated (calculated columns may either be slow or will result to memory issue).

Thanks.

=sumx(
'3_13May',
 if(sumx('3_13May',SWITCH (
        TRUE (),
        '3_13May'[Column1]=EARLIER('3_13May'[Column1]),1,
        '3_13May'[Column2]=EARLIER('3_13May'[Column1]),1,0))=1,1,0
))
6 REPLIES 6
Mariusz
Community Champion
Community Champion

Hi @theo 

Can you send some data sample?

 

Best Regards,
Mariusz

Please feel free to connect with me.
Mariusz Repczynski

 

Hi @Mariusz ;

Here's a representative sample  and some manual calculations I did.

Capture.JPG

Mariusz
Community Champion
Community Champion

Hi @theo 

 

You can achieve this by applying three steps in Query editor.

Please see the M code below based on the example that you provided.

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("dY+xDcQwDAN3cZ0iEpnEmSXw/mu8zm4ej3xBwQB5lPw8LdrWsqSSS0fpbGN7d67p9HrdpdgZ5IJg6J8LHsdqpSZOxsUgHcRzf/NzXkF7rvbknKQwSSbJnMn+69MqjhAlyukLX/iCF7zg1b98UIMa1Oz32m9QgxrUoOYTvtsYHw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [Column1 = _t, Column2 = _t, Column3 = _t, Column4 = _t, Column5 = _t, Column6 = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", Int64.Type}, {"Column2", Int64.Type}, {"Column3", Int64.Type}, {"Column4", Int64.Type}, {"Column5", Int64.Type}, {"Column6", Int64.Type}}),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Changed Type", {}, "Attribute", "Value"),
    #"Grouped Rows" = Table.Group(#"Unpivoted Columns", {"Value"}, {{"Count", each Table.RowCount(_), type number}}),
    #"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each ([Count] <> 1))
in
    #"Filtered Rows"
Best Regards,
Mariusz

If this post helps, then please consider Accepting it as the solution.

Please feel free to connect with me.
Mariusz Repczynski

 

hi @Mariusz Thanks but it doesnt provide the result that I need.

The code you provided just count the duplicate of all entry .

What I need is to count the rows that have duplicate (eg. 5 duplicates) based on the current row.

in my example under dup 5 count, line 1 has a count of 1 since there is one row that has 5 similar entry with line 1 by virtue of 1,2,3,4, and 5.  same goes with line 2 while the rest has 0 meaning no row in the table that has 5 similar entry.

Mariusz
Community Champion
Community Champion

Hi @theo 

Please see the below three tables, I believe that this will do whats required however I do not think it will perform very well with 12 million rows.

// yourSampleTable
let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("dY+xDcQwDAN3cZ0iEpnEmSXw/mu8zm4ej3xBwQB5lPw8LdrWsqSSS0fpbGN7d67p9HrdpdgZ5IJg6J8LHsdqpSZOxsUgHcRzf/NzXkF7rvbknKQwSSbJnMn+69MqjhAlyukLX/iCF7zg1b98UIMa1Oz32m9QgxrUoOYTvtsYHw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [Column1 = _t, Column2 = _t, Column3 = _t, Column4 = _t, Column5 = _t, Column6 = _t]),
    #"Added Index" = Table.AddIndexColumn(Source, "Index", 0, 1),
    #"Reordered Columns" = Table.ReorderColumns(#"Added Index",{"Index", "Column1", "Column2", "Column3", "Column4", "Column5", "Column6"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Reordered Columns",{{"Index", Int64.Type}})
in
    #"Changed Type"
// duplicateCountTable
let
    Source = yourSampleTable,
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(Source, {"Index"}, "Column", "Value"),
    #"Merged Queries" = Table.NestedJoin(#"Unpivoted Columns", {"Value"}, #"Unpivoted Columns", {"Value"}, "Unpivoted Columns", JoinKind.Inner),
    #"Expanded Unpivoted Columns" = Table.ExpandTableColumn(#"Merged Queries", "Unpivoted Columns", {"Index"}, {"Index.1"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Unpivoted Columns", each ([Index.1] <> [Index])),
    #"Grouped Rows" = Table.Group(#"Filtered Rows", {"Index", "Index.1"}, {{"Count", each "Dup " & Number.ToText( List.Count(_) ) & " Count", type text }}),
    #"Sorted Rows" = Table.Sort(#"Grouped Rows",{{"Count", Order.Descending}}),
    #"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Sorted Rows", {{"Count", type text}}, "en-GB"), List.Distinct(Table.TransformColumnTypes(#"Sorted Rows", {{"Count", type text}}, "en-GB")[Count]), "Count", "Index.1", List.Count)
in
    #"Pivoted Column"

 

// mergeOutput
let
    Source = Table.NestedJoin(yourSampleTable, {"Index"}, duplicateCountTable, {"Index"}, "duplicateCountTable", JoinKind.LeftOuter),
    #"Expanded duplicateCountTable" = Table.ExpandTableColumn(Source, "duplicateCountTable", {"Dup 5 Count", "Dup 4 Count", "Dup 3 Count", "Dup 2 Count", "Dup 1 Count"}, {"Dup 5 Count", "Dup 4 Count", "Dup 3 Count", "Dup 2 Count", "Dup 1 Count"})
in
    #"Expanded duplicateCountTable"

 

 

Best Regards,
Mariusz

If this post helps, then please consider Accepting it as the solution.

Please feel free to connect with me.
Mariusz Repczynski

 

Hi @Mariusz Thanks.  Will try this one.   That is also my concern for using power query, it may not be able to work on the 12m rows.  So Im thibking if measure maybe a better alternative.

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.