Solved: Re: Calculating max information gain over time (Li...

raykim · ‎07-22-2019

I have customer churn dataset with the following columns: cutomerID, timestamp, churn flag, and spending.

What I hope to acheive is to calculate max information gain over time.

In order to do so, the following steps are required

1. Define time bins (i.e. 20 days)

2. Define the list of cutoff points (i.e. {$20, $21, $22, $23 ...})

3. Find and count the number of churners and survivors in each time bin

4. Find and count the numer of chuners and survivors with spending more than cutoff points in each time bin

5. Find and count the numer of chuners and survivors with spending less than cutoff points in each time bin

6. Find the max information gain among the information values from each cutoff points

And the problem I am facing is that I cannot iterate through multiple cutoff points and test out which gives the highest information gain value.

I can calculate the information gain for a sing cutoff value and the code is as below.

max_RIG = 

VAR cutoff_point = 30

// prior //
VAR total =
countrows(data)

VAR prior_churned =
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), filter(data, data[churn] = TRUE()))

VAR prior_survived = 
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), filter(data, data[churn] = FALSE()))


// feature is True //

VAR feature_true =
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), 
         filter(data, data[spending] >= cutoff_point))

VAR feature_true_churned = 
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), 
         filter(data, data[spending] >= cutoff_point), filter(data, data[churn] = TRUE()))

VAR feature_true_survived = 
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), 
         filter(data, data[spending] >= cutoff_point), filter(data, data[churn] = FALSE()))


// feature is false//

VAR feature_false =
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), 
         filter(data, data[spending] < cutoff_point))

VAR feature_false_churned = 
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), 
         filter(data, data[spending] < cutoff_point), filter(data, data[churn] = TRUE()))

VAR feature_false_survived = 
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), 
         filter(data, data[spending] < cutoff_point), filter(data, data[churn] = FALSE()))


// RIG //
VAR prior_gain =
prior_churned / total * log(prior_churned / total, 2) + prior_survived / total * log(prior_survived / total, 2)

VAR feature_true_gain =
feature_true/total * (feature_true_churned/feature_true * log(feature_true_churned /  feature_true, 2) + feature_true_survived / feature_true * log(feature_true_survived/feature_true,2))

VAR feature_false_gain =
feature_false/ total * (feature_false_churned/feature_false * log(feature_false_churned/feature_false, 2) + feature_false_survived / feature_false * log(feature_false_survived / feature_false, 2))

RETURN
(prior_gain - (feature_true_gain + feature_false_gain)) / prior_gain

I treid to replace VAR cutoff_point = 30 to

VAR cutoff_point = GENERATESERIES(MIN(data[spending], MAX(data[spending]),1)

and apply max function in the last line

RETURN
MAX(prior_gain - (feature_true_gain + feature_false_gain)) / prior_gain)

But it doesn't work because MAX() function only takes column as a parameter.

Any idea on how to implement this on Power BI?

v-frfei-msft · ‎07-25-2019

Hi @raykim ,

Have a try please.

Measure = 
VAR a =
    UNION (
        ROW ( "value", 0.23 ),
        ROW ( "value", 0.51 ),
        ROW ( "value", 0.18 ),
        ROW ( "value", 0.79 )
    )
RETURN
    MAXX ( a, [value] )

Community Support Team _ Frank
If this post helps, then please consider Accept it as the solution to help the others find it more quickly.

View solution in original post

v-frfei-msft · ‎07-23-2019

Hi @raykim ,

MAXX() should work in your scenario.

Community Support Team _ Frank
If this post helps, then please consider Accept it as the solution to help the others find it more quickly.

raykim · ‎07-23-2019

MAXX would not work because it takes table reference as a parameter.

I would like to calculate the max value of array.

For instance, max of {0.23, 0.51, 0.18, 0.79}

v-frfei-msft · ‎07-25-2019

Hi @raykim ,

Have a try please.

Measure = 
VAR a =
    UNION (
        ROW ( "value", 0.23 ),
        ROW ( "value", 0.51 ),
        ROW ( "value", 0.18 ),
        ROW ( "value", 0.79 )
    )
RETURN
    MAXX ( a, [value] )

Community Support Team _ Frank
If this post helps, then please consider Accept it as the solution to help the others find it more quickly.

Calculating max information gain over time (List of values as variable)

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly