Mean Time Between Failure (MTBF) and Power BI

Greg_Deckler · ‎01-16-2018

Introduction

Mean Time Between Failure (MTBF) is a common term and concept used in equipment and plant maintenance contexts. In addition, MTBF is an important consideration in the development of products. MTBF, along with other maintenance, repair and reliability information, can be extremely valuable to organizations to help identify problematic systems, predict system outages, improve product designs and improve overall operational efficiency and effectiveness. As such, it would be beneficial to understand how to implement this metric within Power BI.

Background

MTBF is essentially the arithmetic mean time between failures of a system. Now, the definition of a failure will depend greatly on the circumstance in which this metric is applied but in general, failures that do not take the system out of service are generally not considered failures. In addition, routine maintenance tasks that are scheduled are also generally not considered failures.

The below image from this Wikipedia Page probably does about as good a job as anything in depicting MTBF.

In formal math language, MTBF is defined as:

Let's Build It!

OK, enough history and background, let's build this puppy!

Step 1: Prepare Your Data

So, I generated some fake data that you can download here. The data has a little over 4,000 rows and represents 3 years of repair data for about a dozen "machines". This data is pretty bare bones but has the essential elements that we will need to demonstrate the technique. Each row represents a single repair. Here is what the data looks like:

So, we essentially have an identifier for the machine, when the repair started, when the repair ended, the type of repair (a repair or preventative maintenance (PM)) and the cause of the repair.

Step 2: Load into Power BI and Create Calculated Fields

This next step is easy, just fire up Power BI Desktop and from the Home tab of the ribbon, choose Get Data | Excel. Point to the data file you just downloaded (above) and choose either the Sheet (MTBF) or the Table (MTBF1) displayed in the navigation and choose Load.

Next, we are going to create two calculated columns. The first one is how long the repair took to complete. The formula for this is:

Repair Hours = DATEDIFF(Repairs[RepairStarted], Repairs[RepairCompleted],SECOND)/3600

Once you create this column, on the Modeling tab of the ribbon, make sure that the Data type is set to Decimal Number and that you specify at least 2 decimal places.

Obviously, all we are doing is subtracting the time the repair started from the time the repair completed so that we end up with how long the repair took to complete. Perhaps less obvious is the reason we specified SECOND instead of HOUR in our DATEDIFF function even though we want this metric in hours and thus the division by 3600 seconds/hour. The reason is that if we had specified HOUR, then DATEDIFF would have truncated the calculation to whole hours and we would have lost fractions of an hour.

The second calculated column is a bit more complex:

Uptime =

VAR next = MINX(FILTER(Repairs,

Repairs[MachineName]=EARLIER(Repairs[MachineName]) &&

Repairs[RepairStarted]>EARLIER(Repairs[RepairStarted]) &&

Repairs[RepairType]<>"PM"

),Repairs[RepairStarted])

RETURN IF([RepairType]="PM", 0,IF(ISBLANK(next),

DATEDIFF([RepairCompleted],NOW(),SECOND),

DATEDIFF([RepairCompleted],next,SECOND)

)

OK, let me explain what is going on here. Remember back to our definition of MTBF, we need to find the uptime between a machine going into an "up" state and that same machine going into a "down" state. But our data isn't structured that way, the data is focused on collecting repair information.

Thus, the first part of the formula, the VAR portion is the calculation of the variable next. This is about finding the next repair after the current repair. We do this by finding the MIN of the RepairStarted column after filtering our Repair table for machines that are equal to the current machine in the row as well as having a RepairStarted that is after the current repair in the row and we are excluding "PM" rows as those are preventative maintenance and not an actual failure.

The second part of the formula, the RETURN portion is really three pieces. If the current row is a maintenance task (PM), then we return 0 for uptime as we do not want those tasks included in our MTBF calculation. If our calculation for next is BLANK, then we know that this is the most recent failure in our data set. So, we return the difference in seconds between NOW and the completion of the current failure. Finally, if neither of those are the case, then we simply calculate the difference in seconds between when our next repair started and the current repair ended.

Step 3: Create some Measures

OK, now that we have our data loaded and have created some extra calculated columns, we can now create the measures that we will use in our report. Create the following measures:

Repairs = CALCULATE(COUNTROWS(Repairs),FILTER(Repairs,[RepairType]<>"PM"))
MTBF (Hours) = DIVIDE(SUM(Repairs[Uptime]),[Repairs],BLANK())/3600
MDT (Hours) = SUM(Repairs[Repair Hours])/COUNTROWS(Repairs)
Last Repair = MAX([RepairCompleted])
Next Expected Repair = [Last Repair] + [MTBF (Hours)]/24

Repairs: This is the total number of repairs that have occurred excluding preventative maintenance (PM) repairs.

MTBF: This is our MTBF calculation that takes the sum of our uptime in seconds, divides by the number of repairs to get an average and then converts the number to hours by dividing by 3600 seconds/hour.

MDT: This is a measure that is often associated with MTBF, MDT is Mean Down Time or the average amount of time that a repair takes to complete. Therefore, we take the sum of our Repair Hours (which does not exclude preventative maintenance tasks) and divides it by the count of all repairs. Note, we do not use the Repairs measure in this case since we want to include PM repairs as well.

Last Repair: This is simply the date/time of our last repair

Next Expected Repair: By knowing when our last repair occurred, we can add our MTBF calculation converted to days (/24) to determine when we expect our next failure to occur.

Step 4: Build the Report

Now that we have all our measures created, it is mere child's play to create a report similar to the following:

I have left everything default so that the visuals themselves show what columns and measures are being displayed. Even this highly simplified report shows valuable information about each of our machines, the types of failures they experience and because we used measures it is all interactive. For example, if we click on "Worn Component" in the middle bar chart:

We can see that failures caused by worn components occur much less often than the average (a MTBF of 300 versus 199 for all failure types) and they take about half the time to complete (MDT of 1.23 versus 2.19 for all failure types).

Conclusion

Other than one potentially complex DAX calculated column, it is extremely easy and straight-forward to use Power BI to create reports that summarize repair information, including MTBF. And, remember that the definition of "failure" can be subjective and can be creatively applied to other subject domains. This example here really is just based on having a start time and an end time for "something". So, could this be applied to something like HR data and start dates and termination dates? Sure, consider each termination and "failure" and MTBF could shed light on the average time it takes to backfill a position perhaps. The possibilities are endless so start putting this technique into use in your own situations in Power BI!

Greg_Deckler · ‎01-19-2018

Added tracking counter:

onlinetimesheet · ‎02-14-2018

I think this is a real great post. really looking forward to read more. Great.

JasonM · ‎04-09-2018

This was really great. Thank you for posting.

I have a use case for this but I am running into an issue because for a given 'machine' I might have more than one record of repairs.

RepairStart, RepairEnd, Category, Location, Machine
1/1/2018 2:40AM, 1/1/2018 4:15 AM, OS Crash, Seattle, Server 1
1/1/2018 3:12AM, 1/1/2018 2:15 AM, OS Crash, Seattle, Server 1

So a repair started after the first, but was resolved before the second.

Messed around with a few things, but am still getting a lot of negative values for MTBF and 'Uptime'.
I tried making a new row key by combing Category, Location, Machine. But its not gotten me what I am after, so I would appreciate some input if anyone sees this and might understand whats going on a bit better than I do.

Cheers!

Anonymous · ‎12-18-2018

Good morning, I am following all the steps of the instructions and I get the error "The syntax of 'IF' is incorrect."
Could you tell me what I'm wrong about?

Regards! Thank you.
Javier.-

rajendraongole1 · ‎09-10-2019

Nice article. thanks for the Guidance and Hands On.

sTaLa · ‎12-13-2019

I had the same issue as

What did the trick for me was to change the "," with ";" in the formula. Maybe that could help other people.

Great article by the way thank you!

jmfos · ‎08-14-2020

Really good post, you helped me a lot.

Thank you!

Anonymous · ‎09-28-2020

This is awesome! only one question, I added an extra column, "Department" so I could sort by department. I think i need to add something to the syntax, my table did not populate when I did this. I can remove the Column and the table will populate. How do I fix this?

Thanks,

AdamPSP · ‎01-05-2021

If there were multiple machines that weren’t always on at once but rather on a schedule, could this repair tracking be expanded to track only the machines “on” for their schedule period: weekly or bi-weekly schedules? If I know when a machines is “on” from a schedule (date and time of start and stop table) could this be applied so that the various measures be only applied to the "on “computers?

Would this also work for monitoring a call-log for when employees are on a call vs not on a call during a schedule work period with predetermined schedule?

Greg_Deckler · ‎01-07-2021

@AdamPSP I do not see anything specific as to why this would not work in the scenarios you listed. However, I would need to understand your data better to determine exactly how you would adapt this approach.

AdamPSP · ‎01-09-2021

Watch swapping.png

@Greg_Deckler Using your example in the post I was able to apply it to tracking time duration of specific tasks of personnel and the time duration of when personnel are not performing a task (this first part works really well for what I'm trying to measure). The challenge I have is the personnel are on a watch periods so not always on like in your Machine example. I want to only measure between a period of time: OnWatch and OffWatch for each person (Machine). I have a spreadsheet of date & time for when persons (MachineName) starts and finishes a watch cycle (OnWatch, OffWatch).

I would like to have this table be part of the time period that measures are taken. Otherwise, the measures are scewed because the measures are calculating a time period the person (machine) is not actaully OnWatch. How would I go about implementing a table (as described above) for the measuring to take into account?

Greg_Deckler · ‎01-11-2021

@AdamPSP I could see a couple different approaches to this, you could filter your data table to only the rows where the machine is "onwatch" and then do your calculation. Or, you could do it in the column and as part of the calculation check to see if it falls into an "onwatch" period.

AdamPSP · ‎01-11-2021

@Greg_Deckler Thank you for getting back to this question. I am new to Power BI and how the various filters would work for this example so please excuse me if I'm missing something seemingly simple.

Ultimately, I want to measure using your “uptime” calculation for only the period of time a person or persons (machines) are “On Watch”. Since the persons (machines) have various OnWatch and OffWatch times, I think this calculation has to be done using a table on the actual vs a set scheduled date & time.

In the below image, each machine would have different start date & time for “OnWatch & OffWatch” periods. During this On Watch period of time (approximately 15 days), there would be uptime & failures for each machine. I would only want to measure the available (On Watch) machines for their uptime and failures.

Sergio_Reese · ‎05-19-2021

Hi @Greg_Deckler , I have a question regarding the use of this formula. With the data that I am using I have different machines and different equipments that correspond to each machine, this formula works great when filtering the column of Machine and Equipment, but it does not work when filterring only the machine, for example if I want to see times between the failures of all equipments in machine one.

I Want to know if its possible to make this type of filtering to get this data because What I am trying to do is just change the label of "Machine and Equipment" to just "Machine" but it is not working, I get a lot of zeros and wrong times.

I add this table as an example and I hope that I was able to explain myself with my question

Thank you for your time

Anonymous · ‎05-31-2021

hello @Greg_Deckler

Your post is really helpful, It would be good. If you share with us the pbix file of the report. I would appreciate your help. Thanks.