Solved: Re: Hypergeometric or binomial probability density...

Anonymous · ‎04-28-2020

In Excel there are easy functions for this, like HYPGEOM.DIST and BINOM.DIST, do these exist in DAX in some form?

I tried creating the formulas myself, but because I'm working with rather large numbers the required factorials can't be computed either and result in #num errors. I read up on approximating them with a log gamma function like Excel's GAMMALN, but again this also doesn't exist in DAX. I looked at how accurate POISSON would approximate the results, and it works fairly okay'ish, but there are rather significant differences in the extreme tail ends, which is exactly what I'm interested in.

I basically have a database of companies and characteristics, and I'm trying to figure out the statistical significance of some of those characteristics for whether a company closed down or not, depending on variables such as industry etc. I can work fine with the database in Power Query and DAX, but I'd love to be able to just use one function to calculate probability density instead of having to report everything in Excel pivot tables and then referencing them with the traditional probability density functions.

Greg_Deckler · ‎04-29-2020

OK, the hypergeometric distribution formula I have in my book is hard to write here because there is no formula editor but it goes like this:

P(X=k) = (K k)( N-K n-k) / (N n)

So, (K k) combinations (N-k n-k) combinations, (N n) combinations. So, I assume you have measures or variables for K, k, N and n. The recipe basically goes like below, it is very well explained in the book exactly what is going on:

Probability = 
    VAR __Error =
        IF(ISBLANK([K Value]) || 
            ISBLANK([k Value 2]) || 
                ISBLANK([N Value]) || 
                    ISBLANK([n Value 2]) || 
                        [n Value 2]<[k Value 2] || 
                            [K Value]<[k Value 2] || 
                                [N Value]<[n Value 2] || 
                                    [N Value]<[K Value],
            TRUE(),
            FALSE()
        )
    VAR __Numerator = 
        IF(
            ISERROR(
                COMBIN([K Value],[k Value 2]) *
                    COMBIN([N Value]-[K Value],[n Value 2]-[k Value 2])
            ),
            -1,
            COMBIN([K Value],[k Value 2]) *
                COMBIN([N Value]-[K Value],[n Value 2]-[k Value 2])
        )
    VAR __Demoninator = 
        IF(
            ISERROR(COMBIN([N Value],[n Value 2])),
            -1,
            COMBIN([N Value],[n Value 2])
        )
RETURN
    IF(
        __Error || __Demoninator = -1 || __Numerator = -1,
        "Bad Parameters",
        DIVIDE(__Numerator,__Demoninator,0)
    )

There is a LOT of error checking going on here because you can generate a ton of errors caculating out the probability. Otherwise, the basic formula is pretty simple because of the COMBIN DAX function.

@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

View solution in original post

Greg_Deckler · ‎04-28-2020

Ha! And my publisher said that they didn't think that the hypergeometric mean formula had much practical use. But yes, I won that argument and it is in my book, DAX Cookbook. If you can share some sample data I can adapt the recipe. But no, there is no "easy" button for DAX I had to invent the formula. It's not terrible to implement. Be sure to @ me because otherwise I may not see your response.

@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

Anonymous · ‎04-29-2020

@Greg_Deckler

The actual database is a bit more complex with a bunch of dimension tables and bridge tables to handle many to many relationships, but let's just assume a very basic table, fCompanies:

Company name	Closed down	Industry
ABC		Boondoggles
BCD		Thingamajigs
CDF		Doohickeys
DFG	2008	Boondoggles
FGH	2010	Boondoggles
GHI	2001	Doohickeys
HIJ	2020	Boondoggles
IJK	2010	Thingamajigs
JKL		Thingamajigs
KLM		Thingamajigs

Some sample measures:

Company count:=COUNTROWS(fCompanies)
Closed down company count:=CALCULATE(COUNT(fCompanies[Closed down]),ISNUMBER(fCompanies[Closed down]))
% closed down:=DIVIDE([Closed down company count],[Company count])

So in the above you'd have n = 10, k = 5, p = 0.5. In the case of the Boondoggles industry in specific, you'd have 3 out of 4 companies closed down. The measure I was trying to make was, in a case like this, a right tailed binomial test to give me the significance of cases where many companies ended up closed.

The actual n I have is around 3000 which became a problem when trying for factorials, but sometimes the sliced distributions I want to compare to the overall distribution still end up very small. If you happen to have a binomial solution I'd prefer that, but hypergeometric would also be welcome. Thanks!

Greg_Deckler · ‎04-29-2020

OK, the hypergeometric distribution formula I have in my book is hard to write here because there is no formula editor but it goes like this:

P(X=k) = (K k)( N-K n-k) / (N n)

So, (K k) combinations (N-k n-k) combinations, (N n) combinations. So, I assume you have measures or variables for K, k, N and n. The recipe basically goes like below, it is very well explained in the book exactly what is going on:

Probability = 
    VAR __Error =
        IF(ISBLANK([K Value]) || 
            ISBLANK([k Value 2]) || 
                ISBLANK([N Value]) || 
                    ISBLANK([n Value 2]) || 
                        [n Value 2]<[k Value 2] || 
                            [K Value]<[k Value 2] || 
                                [N Value]<[n Value 2] || 
                                    [N Value]<[K Value],
            TRUE(),
            FALSE()
        )
    VAR __Numerator = 
        IF(
            ISERROR(
                COMBIN([K Value],[k Value 2]) *
                    COMBIN([N Value]-[K Value],[n Value 2]-[k Value 2])
            ),
            -1,
            COMBIN([K Value],[k Value 2]) *
                COMBIN([N Value]-[K Value],[n Value 2]-[k Value 2])
        )
    VAR __Demoninator = 
        IF(
            ISERROR(COMBIN([N Value],[n Value 2])),
            -1,
            COMBIN([N Value],[n Value 2])
        )
RETURN
    IF(
        __Error || __Demoninator = -1 || __Numerator = -1,
        "Bad Parameters",
        DIVIDE(__Numerator,__Demoninator,0)
    )

There is a LOT of error checking going on here because you can generate a ton of errors caculating out the probability. Otherwise, the basic formula is pretty simple because of the COMBIN DAX function.

@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

RobG123 · ‎10-10-2023

That's great but I think I need the cumulative hypergeometric distribution to do a Fisher exact test. I can get this in excel from functions. Do you know how to produce that in PowerBI?

Also in your code presumably your notation K and k value 2 is equivalent to K and k (small k)?

Thanks
Rob

Anonymous · ‎04-29-2020

@Greg_Deckler

Thanks! That wasn't entirely what I was struggling with but it does help. One of my problems is large K and N values, so COMBIN can't compute them.

But I've also figured out a statistics solution to my large K problem. It's possible to calculate z-scores without involving the binomial coeffecient that requires factorials or COMBIN, and then turn those z-scores into p-values through NORM.S.DIST. I still need to figure out how to translate that into DAX, but it should theoretically work.

Greg_Deckler · ‎04-30-2020

Funny, @Anonymous , the very next recipe in that chapter is called "Determining the required sample size" and it deals with zscores. You may find this file handy (attached below):

@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

Hypergeometric or binomial probability density DAX functions?

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024