Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

The ultimate Microsoft Fabric, Power BI, Azure AI & SQL learning event! Join us in Las Vegas from March 26-28, 2024. Use code MSCUST for a $100 discount. Register Now

Greg_Deckler

Kaplan Meier Survival Curves with Power BI - Part 1

Introduction

Inspired by a post on the Power BI Community site, I read through this fine article by Chris Short on LinkedIn Pulse, Survival Curves in Tableau with HR Data. Overall, this technique looked really interesting and I wondered if I could do something similar in Power BI. Turns out that you can and this article will show you how.

 

I will not reiterate the history of Kaplan Meier Survival Curves. If you just have to know, Chris Short's article referenced above does a great job of summarizing their history and use or you could read this Wikipedia article.

 

Building the Curve

Basically, what we want to do is to use some HR data with a hire date and termination date to build a Survival Curve that describes the time between when employees are hired and terminated.

 

Step 1: Get the Data

Turns out that the data for this is incredibly simple, essentially a single table. I used the same exact data that Chris Short used in his article. The data can be downloaded here. I simply saved the data to an Excel file. The data looks something like this:

km2.png

 

Step 2: Load into Power BI and Create Calculated Fields

This next step starts very similar to the Tableau process, once you have the Excel file, you can load it into Power BI Desktop by choosing Home | Get Data | Excel from the ribbon. Point to Sheet1 and Load the data. Rename the table to "HRData" and then create the following three calculated columns:

 

Count = 1

Days = IF([Status]="Active",DATEDIFF(HRData[Hire Date],TODAY(),DAY),DATEDIFF(HRData[Hire Date],HRData[Termination Date],DAY))
Event = IF(HRData[Status] = "Inactive",1,0)

Count: This is a simple counter column, sure we could use the COUNT or COUNTROWS aggregation instead, but this makes things a little more obvious.

Days: This is the "time to event" calculation. In this case, it is the tenure of the employee in days. Unlike the Tableau method, we do not convert this to years but leave it as days.

Event: As in the Tableau method, we define our event as a person becoming inactive.

 

Step 3: Do Some Data Modeling

Because Power BI works a little differently than Tableau, we need to do some additional data modeling. The Tableau technique relies on a handy little function called PREVIOUS_VALUE, which is really nifty. But we don't really have an equivalent for PREVIOUS_VALUE when working with Power BI Measures so we will need to do some additional work. The reason we need this is that the Kaplan Meier equation is essentially a running product, which is kind of like a running total except that you are doing multiplication instead of addition.

 

We start by creating a New Table using the following formula:

 

KMDaysAll = GENERATESERIES(0,MAX(HRData[Days]),1)

This creates a table with a single Value column that goes from 0 to the maximum Days in the HRData table in increments of 1. Rename the Value column to Days. Switch to the Relationships tab and build a relationship between the Days columns in HRData and KMDaysAll.

 

Now, in KMDaysAll, create the following calculated columns:

 

Count = var records = CALCULATE(SUM(HRData[Count]),RELATEDTABLE(HRData)) RETURN IF(ISBLANK(records),0,records)

e_i = var records = CALCULATE(SUM(HRData[Event]),RELATEDTABLE(HRData)) RETURN IF(ISBLANK(records),0,records)
Running = SUMX(FILTER(ALL(KMDaysAll),[Days]<=EARLIER([Days])),KMDaysAll[Count])
d_i = [Count] + CALCULATE(SUM([Count]),ALL(KMDaysAll)) - [Running]
1-e_i/d_i = 1-[e_i]/[d_i]

OK, these deserve some explanation:

 

Count: This just gets the number of HRData records that correspond to each day.

e_i: This represents the number of events that occur over time. This is the numerator in our Kaplan Meier's calculation.

Running: This is a running total of the Count of HRData records

d_i: This will be the denominator for the Kaplan Meier calculation, which represents the total population of people since the last time period (ti – 1)

1-e_i/d_i: This is the Kaplan Meier curve calculation that we will need to perform a running product on.

 

So, we now create a measure for the calculation our Kaplan Meier curve. This is the Survival Curve, or specifically the probability of survival at ti – which is 1 – the hazard function (probability of not surviving). We do this with a measure as follows:

 

MyKMDaysAll = CALCULATE(PRODUCT('KMDaysAll'[1-e_i/d_i]), FILTER(ALLSELECTED('KMDaysAll'[Days]),ISONORAFTER('KMDaysAll'[Days], MAX('KMDaysAll'[Days]), DESC)))

Basically, this is your standard running total calculation when you use a Running Total Quick Measure except that we use PRODUCT as our aggregation. We are using a Running Product essentially in order to get around not having a PREVIOUS_VALUE function.

 

Step 4: Build the Curve

Now you can just choose a Line chart visual and put Days from the KMDaysAll table in the x-axis and the MyKMDaysAll measure in the y-axis. A little formatting and you should get something like the image below:

 

km3.png

 

Conclusion

This is Part 1. Stay tuned for Part 2 to see how to separate out and view individual Departments!

Comments

Add tracking counter.

 

 

 

Hi Greg.

 

I am using you example as base of some work I am trying to get done. The steps are:

Create new table with all of the periods like you have it in your example. (1-60)

KMMembersAll2 = GENERATESERIES(1,MAX(MemberSurv[SURVIVALLENGTH]),1)

 

Create these measures one at a time:

MEASURE KMMembersAll2[FailureCount2] = SUMX(MemberSurv, MemberSurv[SURVIVALEVENT])

MEASURE KMMembersAll2[CensoredCount2] = SUMX(MemberSurv, MemberSurv[CensoredEvent])

MEASURE KMMembersAll2[Members running total in Period2] = CALCULATE(
SUMX(MemberSurv, 'MemberSurv'[Count]),
FILTER(
ALL('KMMembersAll2'),
ISONORAFTER('KMMembersAll2'[Period2], MAX('KMMembersAll2'[Period2]), ASC)
)
)


MEASURE KMMembersAll2[HazardRate2] = [FailureCount2]/KMMembersAll2[Members running total in Period2]

MEASURE KMMembersAll2[one-HR2] = (1- [HazardRate2])

 

But I am having real hard time to get the running product of KMMembersAll2[one-HR2] (survival rate) .

One of my attempts is this one. It seems to be doing the running product but not from the smallest to the lartest period.  It is displaying the running product from period 60 to period 1:

MEASURE KMMembersAll2[Surv] =
CALCULATE(
PRODUCTX(
FILTER(
ALL('KMMembersAll2'[Period2]),
ISONORAFTER('KMMembersAll2'[Period2], MAX('KMMembersAll2'[Period2]), ASC)
),
KMMembersAll2[one-HR2]
)
)

 

Any thoughts of what I am doing wrong?

 

I really appreaciate any help.

Thank you.

Nabil

 

 

Anonymous

By following your steps, my curve looks like the below. As I'm spot checking it looks like I'm getting different values as well. Is the data set different than when this post was created? Or am I just doing something wrong...

 

I just copy/pasted the columns given in the tutorial.

 

2018-07-03_15-06-38.png

The table posted was just partial data. Did you download the full data from here:

 

https://www.dropbox.com/s/a8gwpp6cfm9jw0w/Survival%20Curve%20Data.xlsx?dl=0&lipi=urn%3Ali%3Apage%3Ad...

 

Anonymous

Yes I did.

Anonymous

I found the problem! I missed building the connection between the HRData and KMDaysAll tables. Thanks again for your help, it's really appreciated!

Since calculated tables are evaluated on data load time, this approach will not work with filters.

If you want to do this with filters, you'll need to use R.

Here's an example: http://www.ilkkapeltola.fi/2018/10/using-survival-plot-to-analyze-churn-in.html

Hi @Greg_Deckler,

 

I've been working through your tutorial, and first want to say thank you for putting together something so easy to follow and comprehensive in the setup!

 

My question is around the filtering, which has been touched on here by others, as well as yourself, but for which I still wanted to ask. Is there a method to filter the curve or otherwise show curves for multiple fields without Steps 5-7? In my work, we are typically looking at breakouts of 40+ (depending on the field), and so cannot rely on your current solution.

 

Please let me know if I should provide more info before you respond.

 

 

Regards,

Dave

 

Hi @Greg_Deckler,

 

Is there any way to create a Load Duration Curve?

In the y-axis I would have the power in kW and in the x-axis I would have the duration (over a year).

The data I have is energy in kWh every 15 minutes.

 

Thank you for your time.

Best regards,

Eduard Melo

Anonymous

 

 

Tx, I have made a survival curve with my data!
Any suggestions on how the curve can be adapted by filter?

Michiel
KMSurvival.jpg

mthomeer, you cannot with DAX operations.

 

You need R for it.

 

Here's a way to do it in R:

https://www.ilkkapeltola.fi/2018/10/using-survival-plot-to-analyze-churn-in.html

Was the first time I used R, but made it.

 

After installing R-script, just put the following in the R-script editor:

library(survival)
library(survminer)
library(lubridate)

plot(survfit(Surv(dataset$Days,dataset$Event)~1 , data = dataset))

@Greg_Deckler bringing this one back fromt the dead.

 

I'm trying to do this same analysis but with slightly different data and I'm having a bit of trouble figuring out how to apply your method to my data.

 

I have a table full of equipment failures. In this table I have the total hours on each piece of equipment when the failure happened, a failure id number, the failure date (which I think is irrelevant) and a serial number for each piece of equipment. The failure ID is unique for each failure with no repeats, but there are instances when a serial number can fail mulitple times, and there can be multiple failures for different serial numbers at the same hour count. Below is a replica of what the table looks like.

 

DTFailureIDEquip_HrsSerial_Num
1/4/202215001654
1/6/2022210006987
1/8/202235009819
1/11/202248009816
1/21/2022518006987

 

Any advice on how to proceed?

Running = SUMX(FILTER(ALL(KMDaysAll),[Days]<=EARLIER([Days])),KMDaysAll[Count])

In here, What should [Days] refer to? HRData[Days]?

Both [Days] and HRData[Days] are not working...   

Insub_0-1676605616275.png

 

 

*** Sorry, my bad. I didn't read what you wrote down. Everything is alright now. Thank you!