Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
BastiaanBrak
Helper IV
Helper IV

Jaccard Index (similarity metric) calculation in 'Power Query'

hi all, I have a dataset consisting of a table with 33 columns x 30 rows. The values in each cell are text and I want to calculate the so-called Jaccard Index, a measure of similarity, for each combination of two columns. I can do this manually in Power Query but for a table with 33 columns this results in 528 comparisons so I'm hoping this could be automated somehow.

 

To clarify with a simplified example (see link with attached .pbix at bottom), suppose I have four fruit smoothie recipes, each with five ingredients:

BastiaanBrak_0-1629151644331.png

The Jaccard Index is calculated as the number of values featuring in both sets (intersection), divided by the unique number across both sets (union). With 4 recipes there are 6 comparisons: 1&2, 1&3, 1&4, 2&3, 2&4 and 3&4.

 

So for Recipe 1&4 the Jaccard Index is 4 (i.e. Pineapple, Strawberry, Banana and Kiwi) divided by 6 = 0.667, whereas for Recipe 1&2 the Jaccard Index is 1 (i.e Strawberry) divided by 9 = 0.11

For this demo it is straightforward to calculate the Jaccard Indices for each of the 6 combinations in Power Query and store in table like this (see link to attached .pbix file at bottom):

BastiaanBrak_1-1629152780993.png

but with my actual dataset existing of 33 columns doing this manually is not feasible.

 

Is there a way to automate this in Power Query or should I do something like this in, say, R / Python?

 

.pbix file: https://file.io/tqm0sQrFR8jS

Many thanks, Bastiaan

2 ACCEPTED SOLUTIONS
parry2k
Super User
Super User

@BastiaanBrak see the attached solution, it is scalable to add many recipes or ingredients you add into the model.

 

Two parts of the solution:

 

Power Query created a table called Recipe and Recipe Ingredients, and there is a function that does some transformation called fnRecipe

 

2nd, there is a DAX measure called Jaccard Index that does the calculation based on the combination, and here is the output.

 

parry2k_0-1629158638791.png

 

You can tweak the solution as your fit and optimize the PQ step. If something is not clear, do let me know.

 

Follow us on LinkedIn

 

Check my latest blog post The Power of Using Calculation Groups with Inactive Relationships (Part 1) (perytus.com) I would  Kudos if my solution helped. 👉 If you can spend time posting the question, you can also make efforts to give Kudos to whoever helped to solve your problem. It is a token of appreciation!

 

Visit us at https://perytus.com, your one-stop-shop for Power BI-related projects/training/consultancy.

 

 



Subscribe to the @PowerBIHowTo YT channel for an upcoming video on List and Record functions in Power Query!!

Learn Power BI and Fabric - subscribe to our YT channel - Click here: @PowerBIHowTo

If my solution proved useful, I'd be delighted to receive Kudos. When you put effort into asking a question, it's equally thoughtful to acknowledge and give Kudos to the individual who helped you solve the problem. It's a small gesture that shows appreciation and encouragement! ❤


Did I answer your question? Mark my post as a solution. Proud to be a Super User! Appreciate your Kudos 🙂
Feel free to email me with any of your BI needs.

View solution in original post

@BastiaanBrak and @parry2k I couldn't resist doing this in DAX. Assumes an unpivoted table:

Jaccard = 
    VAR __Table = DISTINCT('Recipe Ingredients'[Recipe])
    VAR __Table1 = GENERATE(SELECTCOLUMNS(__Table,"Recipe 1",[Recipe]),SELECTCOLUMNS(__Table,"Recipe 2",[Recipe]))
    VAR __Table2 = FILTER(__Table1,[Recipe 1] <> [Recipe 2])
    VAR __Table3 = ADDCOLUMNS(__Table2,"1",RIGHT([Recipe 2],1),"2",RIGHT([Recipe 1],1))
    VAR __Table4 = FILTER(__Table3,[1]<[2])
    VAR __Table5 = ADDCOLUMNS(__Table4,"Recipe","Recipe " & [1] & "&" & [2])
    VAR __Table6 = ADDCOLUMNS(__Table5,"Total",
        COUNTROWS(
            DISTINCT(
                UNION(
                    SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 1]),"Ingredient",[Ingredient]),
                    SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 2]),"Ingredient",[Ingredient])
                )
            )
        )
    )
    VAR __Table7 = ADDCOLUMNS(__Table6,"Same",
        COUNTROWS(
            INTERSECT(
                SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 1]),"Ingredient",[Ingredient]),
                SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 2]),"Ingredient",[Ingredient])
            )
        )
    )
    VAR __Table8 = ADDCOLUMNS(__Table7,"Jaccard Index",DIVIDE([Same],[Total],0))
RETURN
    SELECTCOLUMNS(__Table8,"Recipe",[Recipe],"Jaccard Index",[Jaccard Index])

 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

View solution in original post

5 REPLIES 5
BastiaanBrak
Helper IV
Helper IV

@parry2k @Greg_Deckler Thanks so much both of you!! I'm not sure if I can accept both replies as solution but will try.

parry2k
Super User
Super User

@Greg_Deckler Looks good. You are creating a calculated table, which makes sense. Thanks for sharing.

 

The only point I want to add, in case users want to extend the functionality to slicer/dice the data, calculate table functionality will not work.



Subscribe to the @PowerBIHowTo YT channel for an upcoming video on List and Record functions in Power Query!!

Learn Power BI and Fabric - subscribe to our YT channel - Click here: @PowerBIHowTo

If my solution proved useful, I'd be delighted to receive Kudos. When you put effort into asking a question, it's equally thoughtful to acknowledge and give Kudos to the individual who helped you solve the problem. It's a small gesture that shows appreciation and encouragement! ❤


Did I answer your question? Mark my post as a solution. Proud to be a Super User! Appreciate your Kudos 🙂
Feel free to email me with any of your BI needs.

@parry2k Yeah, I went with a calculated table but in theory you could do the same thing with a measure and it would be pretty similar code overall. You would have to assume two independent slicers, etc. etc. All you would need then is the COUNTROWS from Table7 and the COUNTROWS from Table8 at that point. 

 

Nice Power Query code BTW!


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...
parry2k
Super User
Super User

@BastiaanBrak see the attached solution, it is scalable to add many recipes or ingredients you add into the model.

 

Two parts of the solution:

 

Power Query created a table called Recipe and Recipe Ingredients, and there is a function that does some transformation called fnRecipe

 

2nd, there is a DAX measure called Jaccard Index that does the calculation based on the combination, and here is the output.

 

parry2k_0-1629158638791.png

 

You can tweak the solution as your fit and optimize the PQ step. If something is not clear, do let me know.

 

Follow us on LinkedIn

 

Check my latest blog post The Power of Using Calculation Groups with Inactive Relationships (Part 1) (perytus.com) I would  Kudos if my solution helped. 👉 If you can spend time posting the question, you can also make efforts to give Kudos to whoever helped to solve your problem. It is a token of appreciation!

 

Visit us at https://perytus.com, your one-stop-shop for Power BI-related projects/training/consultancy.

 

 



Subscribe to the @PowerBIHowTo YT channel for an upcoming video on List and Record functions in Power Query!!

Learn Power BI and Fabric - subscribe to our YT channel - Click here: @PowerBIHowTo

If my solution proved useful, I'd be delighted to receive Kudos. When you put effort into asking a question, it's equally thoughtful to acknowledge and give Kudos to the individual who helped you solve the problem. It's a small gesture that shows appreciation and encouragement! ❤


Did I answer your question? Mark my post as a solution. Proud to be a Super User! Appreciate your Kudos 🙂
Feel free to email me with any of your BI needs.

@BastiaanBrak and @parry2k I couldn't resist doing this in DAX. Assumes an unpivoted table:

Jaccard = 
    VAR __Table = DISTINCT('Recipe Ingredients'[Recipe])
    VAR __Table1 = GENERATE(SELECTCOLUMNS(__Table,"Recipe 1",[Recipe]),SELECTCOLUMNS(__Table,"Recipe 2",[Recipe]))
    VAR __Table2 = FILTER(__Table1,[Recipe 1] <> [Recipe 2])
    VAR __Table3 = ADDCOLUMNS(__Table2,"1",RIGHT([Recipe 2],1),"2",RIGHT([Recipe 1],1))
    VAR __Table4 = FILTER(__Table3,[1]<[2])
    VAR __Table5 = ADDCOLUMNS(__Table4,"Recipe","Recipe " & [1] & "&" & [2])
    VAR __Table6 = ADDCOLUMNS(__Table5,"Total",
        COUNTROWS(
            DISTINCT(
                UNION(
                    SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 1]),"Ingredient",[Ingredient]),
                    SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 2]),"Ingredient",[Ingredient])
                )
            )
        )
    )
    VAR __Table7 = ADDCOLUMNS(__Table6,"Same",
        COUNTROWS(
            INTERSECT(
                SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 1]),"Ingredient",[Ingredient]),
                SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 2]),"Ingredient",[Ingredient])
            )
        )
    )
    VAR __Table8 = ADDCOLUMNS(__Table7,"Jaccard Index",DIVIDE([Same],[Total],0))
RETURN
    SELECTCOLUMNS(__Table8,"Recipe",[Recipe],"Jaccard Index",[Jaccard Index])

 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.