Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
Anonymous
Not applicable

Hypergeometric or binomial probability density DAX functions?

In Excel there are easy functions for this, like HYPGEOM.DIST and BINOM.DIST, do these exist in DAX in some form?

 

I tried creating the formulas myself, but because I'm working with rather large numbers the required factorials can't be computed either and result in #num errors. I read up on approximating them with a log gamma function like Excel's GAMMALN, but again this also doesn't exist in DAX. I looked at how accurate POISSON would approximate the results, and it works fairly okay'ish, but there are rather significant differences in the extreme tail ends, which is exactly what I'm interested in.

 

I basically have a database of companies and characteristics, and I'm trying to figure out the statistical significance of some of those characteristics for whether a company closed down or not, depending on variables such as industry etc. I can work fine with the database in Power Query and DAX, but I'd love to be able to just use one function to calculate probability density instead of having to report everything in Excel pivot tables and then referencing them with the traditional probability density functions.

1 ACCEPTED SOLUTION

OK, the hypergeometric distribution formula I have in my book is hard to write here because there is no formula editor but it goes like this:

 

P(X=k) = (K k)( N-K n-k) / (N n)

 

So, (K k) combinations (N-k n-k) combinations, (N n) combinations. So, I assume you have measures or variables for K, k, N and n. The recipe basically goes like below, it is very well explained in the book exactly what is going on:

 

Probability = 
    VAR __Error =
        IF(ISBLANK([K Value]) || 
            ISBLANK([k Value 2]) || 
                ISBLANK([N Value]) || 
                    ISBLANK([n Value 2]) || 
                        [n Value 2]<[k Value 2] || 
                            [K Value]<[k Value 2] || 
                                [N Value]<[n Value 2] || 
                                    [N Value]<[K Value],
            TRUE(),
            FALSE()
        )
    VAR __Numerator = 
        IF(
            ISERROR(
                COMBIN([K Value],[k Value 2]) *
                    COMBIN([N Value]-[K Value],[n Value 2]-[k Value 2])
            ),
            -1,
            COMBIN([K Value],[k Value 2]) *
                COMBIN([N Value]-[K Value],[n Value 2]-[k Value 2])
        )
    VAR __Demoninator = 
        IF(
            ISERROR(COMBIN([N Value],[n Value 2])),
            -1,
            COMBIN([N Value],[n Value 2])
        )
RETURN
    IF(
        __Error || __Demoninator = -1 || __Numerator = -1,
        "Bad Parameters",
        DIVIDE(__Numerator,__Demoninator,0)
    )

 

There is a LOT of error checking going on here because you can generate a ton of errors caculating out the probability. Otherwise, the basic formula is pretty simple because of the COMBIN DAX function.

 

 

 

 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

View solution in original post

6 REPLIES 6
Greg_Deckler
Super User
Super User

Ha! And my publisher said that they didn't think that the hypergeometric mean formula had much practical use. But yes, I won that argument and it is in my book, DAX Cookbook. If you can share some sample data I can adapt the recipe. But no, there is no "easy" button for DAX I had to invent the formula. It's not terrible to implement. Be sure to @ me because otherwise I may not see your response.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...
Anonymous
Not applicable

@Greg_Deckler 

The actual database is a bit more complex with a bunch of dimension tables and bridge tables to handle many to many relationships, but let's just assume a very basic table, fCompanies:

 

Company nameClosed downIndustry
ABC Boondoggles
BCD Thingamajigs
CDF Doohickeys
DFG2008Boondoggles
FGH2010Boondoggles
GHI2001Doohickeys
HIJ2020Boondoggles
IJK2010Thingamajigs
JKL Thingamajigs
KLM Thingamajigs

 

Some sample measures:

Company count:=COUNTROWS(fCompanies)
Closed down company count:=CALCULATE(COUNT(fCompanies[Closed down]),ISNUMBER(fCompanies[Closed down]))
% closed down:=DIVIDE([Closed down company count],[Company count])

 

So in the above you'd have n = 10, k = 5, p = 0.5. In the case of the Boondoggles industry in specific, you'd have 3 out of 4 companies closed down. The measure I was trying to make was, in a case like this, a right tailed binomial test to give me the significance of cases where many companies ended up closed.

 

The actual n I have is around 3000 which became a problem when trying for factorials, but sometimes the sliced distributions I want to compare to the overall distribution still end up very small. If you happen to have a binomial solution I'd prefer that, but hypergeometric would also be welcome. Thanks!

OK, the hypergeometric distribution formula I have in my book is hard to write here because there is no formula editor but it goes like this:

 

P(X=k) = (K k)( N-K n-k) / (N n)

 

So, (K k) combinations (N-k n-k) combinations, (N n) combinations. So, I assume you have measures or variables for K, k, N and n. The recipe basically goes like below, it is very well explained in the book exactly what is going on:

 

Probability = 
    VAR __Error =
        IF(ISBLANK([K Value]) || 
            ISBLANK([k Value 2]) || 
                ISBLANK([N Value]) || 
                    ISBLANK([n Value 2]) || 
                        [n Value 2]<[k Value 2] || 
                            [K Value]<[k Value 2] || 
                                [N Value]<[n Value 2] || 
                                    [N Value]<[K Value],
            TRUE(),
            FALSE()
        )
    VAR __Numerator = 
        IF(
            ISERROR(
                COMBIN([K Value],[k Value 2]) *
                    COMBIN([N Value]-[K Value],[n Value 2]-[k Value 2])
            ),
            -1,
            COMBIN([K Value],[k Value 2]) *
                COMBIN([N Value]-[K Value],[n Value 2]-[k Value 2])
        )
    VAR __Demoninator = 
        IF(
            ISERROR(COMBIN([N Value],[n Value 2])),
            -1,
            COMBIN([N Value],[n Value 2])
        )
RETURN
    IF(
        __Error || __Demoninator = -1 || __Numerator = -1,
        "Bad Parameters",
        DIVIDE(__Numerator,__Demoninator,0)
    )

 

There is a LOT of error checking going on here because you can generate a ton of errors caculating out the probability. Otherwise, the basic formula is pretty simple because of the COMBIN DAX function.

 

 

 

 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

That's great but I think I need the cumulative hypergeometric distribution to do a Fisher exact test. I can get this in excel from functions. Do you know how to produce that in PowerBI?

 

Also in your code presumably your notation K and k value 2 is equivalent to K and k (small k)?

 

Thanks
Rob

 

Anonymous
Not applicable

@Greg_Deckler 

Thanks! That wasn't entirely what I was struggling with but it does help. One of my problems is large K and N values, so COMBIN can't compute them.

 

But I've also figured out a statistics solution to my large K problem. It's possible to calculate z-scores without involving the binomial coeffecient that requires factorials or COMBIN, and then turn those z-scores into p-values through NORM.S.DIST. I still need to figure out how to translate that into DAX, but it should theoretically work. 

Funny, @Anonymous , the very next recipe in that chapter is called "Determining the required sample size" and it deals with zscores. You may find this file handy (attached below):

 

 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors