Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
raykim
Frequent Visitor

Calculating max information gain over time (List of values as variable)

I have customer churn dataset with the following columns: cutomerID, timestamp, churn flag, and spending.

What I hope to acheive is to calculate max information gain over time.

 

In order to do so, the following steps are required

1. Define time bins (i.e. 20 days)

2. Define the list of cutoff points (i.e. {$20, $21, $22, $23 ...})

3. Find and count the number of churners and survivors in each time bin

4. Find and count the numer of chuners and survivors with spending more than cutoff points in each time bin 

5. Find and count the numer of chuners and survivors with spending less than cutoff points in each time bin 

6. Find the max information gain among the information values from each cutoff points

 

And the problem I am facing is that I cannot iterate through multiple cutoff points and test out which gives the highest information gain value. 

I can calculate the information gain for a sing cutoff value and the code is as below.

 

max_RIG = 

VAR cutoff_point = 30

// prior //
VAR total =
countrows(data)

VAR prior_churned =
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), filter(data, data[churn] = TRUE()))

VAR prior_survived = 
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), filter(data, data[churn] = FALSE()))


// feature is True //

VAR feature_true =
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), 
         filter(data, data[spending] >= cutoff_point))

VAR feature_true_churned = 
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), 
         filter(data, data[spending] >= cutoff_point), filter(data, data[churn] = TRUE()))

VAR feature_true_survived = 
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), 
         filter(data, data[spending] >= cutoff_point), filter(data, data[churn] = FALSE()))


// feature is false//

VAR feature_false =
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), 
         filter(data, data[spending] < cutoff_point))

VAR feature_false_churned = 
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), 
         filter(data, data[spending] < cutoff_point), filter(data, data[churn] = TRUE()))

VAR feature_false_survived = 
calculate(countrows(data), filter(data, data[timestamp] >= RIG_calc[start_date]), filter(data, data[timestamp] < RIG_calc[end_date]), 
         filter(data, data[spending] < cutoff_point), filter(data, data[churn] = FALSE()))


// RIG //
VAR prior_gain =
prior_churned / total * log(prior_churned / total, 2) + prior_survived / total * log(prior_survived / total, 2)

VAR feature_true_gain =
feature_true/total * (feature_true_churned/feature_true * log(feature_true_churned /  feature_true, 2) + feature_true_survived / feature_true * log(feature_true_survived/feature_true,2))

VAR feature_false_gain =
feature_false/ total * (feature_false_churned/feature_false * log(feature_false_churned/feature_false, 2) + feature_false_survived / feature_false * log(feature_false_survived / feature_false, 2))

RETURN
(prior_gain - (feature_true_gain + feature_false_gain)) / prior_gain

I treid to replace VAR cutoff_point = 30 to

VAR cutoff_point = GENERATESERIES(MIN(data[spending], MAX(data[spending]),1)

and apply max function in the last line

RETURN
MAX(prior_gain - (feature_true_gain + feature_false_gain)) / prior_gain)

But it doesn't work because MAX() function only takes column as a parameter.

Any idea on how to implement this on Power BI?

 

 

1 ACCEPTED SOLUTION

Hi @raykim ,

 

Have a try please.

 

Measure = 
VAR a =
    UNION (
        ROW ( "value", 0.23 ),
        ROW ( "value", 0.51 ),
        ROW ( "value", 0.18 ),
        ROW ( "value", 0.79 )
    )
RETURN
    MAXX ( a, [value] )

Capture.PNG

Community Support Team _ Frank
If this post helps, then please consider Accept it as the solution to help the others find it more quickly.

View solution in original post

3 REPLIES 3
v-frfei-msft
Community Support
Community Support

Hi @raykim ,

 

MAXX() should work in your scenario.

 

 

Community Support Team _ Frank
If this post helps, then please consider Accept it as the solution to help the others find it more quickly.

MAXX would not work because it takes table reference as a parameter.

I would like to calculate the max value of array.

For instance, max of {0.23, 0.51, 0.18, 0.79}

Hi @raykim ,

 

Have a try please.

 

Measure = 
VAR a =
    UNION (
        ROW ( "value", 0.23 ),
        ROW ( "value", 0.51 ),
        ROW ( "value", 0.18 ),
        ROW ( "value", 0.79 )
    )
RETURN
    MAXX ( a, [value] )

Capture.PNG

Community Support Team _ Frank
If this post helps, then please consider Accept it as the solution to help the others find it more quickly.

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.