cancel
Showing results for 
Search instead for 
Did you mean: 
Greg_Deckler

Mean Time Between Failure (MTBF) and Power BI

Introduction

Mean Time Between Failure (MTBF) is a common term and concept used in equipment and plant maintenance contexts. In addition, MTBF is an important consideration in the development of products. MTBF, along with other maintenance, repair and reliability information, can be extremely valuable to organizations to help identify problematic systems, predict system outages, improve product designs and improve overall operational efficiency and effectiveness. As such, it would be beneficial to understand how to implement this metric within Power BI.

Background

MTBF is essentially the arithmetic mean time between failures of a system. Now, the definition of a failure will depend greatly on the circumstance in which this metric is applied but in general, failures that do not take the system out of service are generally not considered failures. In addition, routine maintenance tasks that are scheduled are also generally not considered failures.

The below image from this Wikipedia Page probably does about as good a job as anything in depicting MTBF.

 

mtbf5.png

 

In formal math language, MTBF is defined as:

 

mtbf2.png

 

Let's Build It!

OK, enough history and background, let's build this puppy!

 

Step 1: Prepare Your Data

So, I generated some fake data that you can download here. The data has a little over 4,000 rows and represents 3 years of repair data for about a dozen "machines". This data is pretty bare bones but has the essential elements that we will need to demonstrate the technique. Each row represents a single repair. Here is what the data looks like:

 

mtbf3.png

 

So, we essentially have an identifier for the machine, when the repair started, when the repair ended, the type of repair (a repair or preventative maintenance (PM)) and the cause of the repair.

 

Step 2: Load into Power BI and Create Calculated Fields

This next step is easy, just fire up Power BI Desktop and from the Home tab of the ribbon, choose Get Data | Excel. Point to the data file you just downloaded (above) and choose either the Sheet (MTBF) or the Table (MTBF1) displayed in the navigation and choose Load.

 

Next, we are going to create two calculated columns. The first one is how long the repair took to complete. The formula for this is:

Repair Hours = DATEDIFF(Repairs[RepairStarted], Repairs[RepairCompleted],SECOND)/3600

Once you create this column, on the Modeling tab of the ribbon, make sure that the Data type is set to Decimal Number and that you specify at least 2 decimal places.

 

Obviously, all we are doing is subtracting the time the repair started from the time the repair completed so that we end up with how long the repair took to complete. Perhaps less obvious is the reason we specified SECOND instead of HOUR in our DATEDIFF function even though we want this metric in hours and thus the division by 3600 seconds/hour. The reason is that if we had specified HOUR, then DATEDIFF would have truncated the calculation to whole hours and we would have lost fractions of an hour.

 

The second calculated column is a bit more complex:

 

Uptime =

VAR next = MINX(FILTER(Repairs,

                       Repairs[MachineName]=EARLIER(Repairs[MachineName]) &&

                       Repairs[RepairStarted]>EARLIER(Repairs[RepairStarted]) &&

                       Repairs[RepairType]<>"PM"

               ),Repairs[RepairStarted])

RETURN IF([RepairType]="PM", 0,IF(ISBLANK(next),

               DATEDIFF([RepairCompleted],NOW(),SECOND),

               DATEDIFF([RepairCompleted],next,SECOND)

           )

       )

 

OK, let me explain what is going on here. Remember back to our definition of MTBF, we need to find the uptime between a machine going into an "up" state and that same machine going into a "down" state. But our data isn't structured that way, the data is focused on collecting repair information.

 

Thus, the first part of the formula, the VAR portion is the calculation of the variable next. This is about finding the next repair after the current repair. We do this by finding the MIN of the RepairStarted column after filtering our Repair table for machines that are equal to the current machine in the row as well as having a RepairStarted that is after the current repair in the row and we are excluding "PM" rows as those are preventative maintenance and not an actual failure.

 

The second part of the formula, the RETURN portion is really three pieces. If the current row is a maintenance task (PM), then we return 0 for uptime as we do not want those tasks included in our MTBF calculation. If our calculation for next is BLANK, then we know that this is the most recent failure in our data set. So, we return the difference in seconds between NOW and the completion of the current failure. Finally, if neither of those are the case, then we simply calculate the difference in seconds between when our next repair started and the current repair ended.

 

Step 3: Create some Measures

OK, now that we have our data loaded and have created some extra calculated columns, we can now create the measures that we will use in our report. Create the following measures:

  1. Repairs = CALCULATE(COUNTROWS(Repairs),FILTER(Repairs,[RepairType]<>"PM"))
  2. MTBF (Hours) = DIVIDE(SUM(Repairs[Uptime]),[Repairs],BLANK())/3600
  3. MDT (Hours) = SUM(Repairs[Repair Hours])/COUNTROWS(Repairs)
  4. Last Repair = MAX([RepairCompleted])
  5. Next Expected Repair = [Last Repair] + [MTBF (Hours)]/24

Repairs: This is the total number of repairs that have occurred excluding preventative maintenance (PM) repairs.

 

MTBF: This is our MTBF calculation that takes the sum of our uptime in seconds, divides by the number of repairs to get an average and then converts the number to hours by dividing by 3600 seconds/hour.

 

MDT: This is a measure that is often associated with MTBF, MDT is Mean Down Time or the average amount of time that a repair takes to complete. Therefore, we take the sum of our Repair Hours (which does not exclude preventative maintenance tasks) and divides it by the count of all repairs. Note, we do not use the Repairs measure in this case since we want to include PM repairs as well.

 

Last Repair: This is simply the date/time of our last repair

 

Next Expected Repair: By knowing when our last repair occurred, we can add our MTBF calculation converted to days (/24) to determine when we expect our next failure to occur.

 

Step 4: Build the Report

Now that we have all our measures created, it is mere child's play to create a report similar to the following:

 

mtbf1.png

 

I have left everything default so that the visuals themselves show what columns and measures are being displayed. Even this highly simplified report shows valuable information about each of our machines, the types of failures they experience and because we used measures it is all interactive. For example, if we click on "Worn Component" in the middle bar chart:

 

mtbf4.png

 

We can see that failures caused by worn components occur much less often than the average (a MTBF of 300 versus 199 for all failure types) and they take about half the time to complete (MDT of 1.23 versus 2.19 for all failure types).

 

Conclusion

Other than one potentially complex DAX calculated column, it is extremely easy and straight-forward to use Power BI to create reports that summarize repair information, including MTBF. And, remember that the definition of "failure" can be subjective and can be creatively applied to other subject domains. This example here really is just based on having a start time and an end time for "something". So, could this be applied to something like HR data and start dates and termination dates? Sure, consider each termination and "failure" and MTBF could shed light on the average time it takes to backfill a position perhaps. The possibilities are endless so start putting this technique into use in your own situations in Power BI!

Comments

Added tracking counter:

 

This was really great. Thank you for posting. 

I have a use case for this but I am running into an issue because for a given 'machine' I might have more than one record of repairs.

RepairStart, RepairEnd, Category, Location, Machine
1/1/2018 2:40AM, 1/1/2018 4:15 AM, OS Crash, Seattle, Server 1
1/1/2018 3:12AM, 1/1/2018 2:15 AM, OS Crash, Seattle, Server 1


So a repair started after the first, but was resolved before the second.  

Messed around with a few things, but am still getting a lot of negative values for MTBF and 'Uptime'.
I tried making a new row key by combing Category, Location, Machine. But its not gotten me what I am after, so I would appreciate some input if anyone sees this and might understand whats going on a bit better than I do. 

Cheers!




Anonymous

Good morning, I am following all the steps of the instructions and I get the error "The syntax of 'IF' is incorrect."
Could you tell me what I'm wrong about?

pregunta.JPG

 

 


Regards! Thank you.
Javier.-

If there were multiple machines that weren’t always on at once but rather on a schedule, could this repair tracking be expanded to track only the machines “on” for their schedule period: weekly or bi-weekly schedules? If I know when a machines is “on” from a schedule (date and time of start and stop table) could this  be applied so that the various measures be only applied to the "on “computers?

 

Would this also work for monitoring a call-log for when employees are on a call vs not on a call during a schedule work period with predetermined schedule?  

@AdamPSP I do not see anything specific as to why this would not work in the scenarios you listed. However, I would need to understand your data better to determine exactly how you would adapt this approach.

Watch swapping.png

@Greg_Deckler Using your example in the post I was able to apply it to tracking time duration of specific tasks of personnel and the time duration of when personnel are not performing a task (this first part works really well for what I'm trying to measure). The challenge I have is the personnel are on a watch periods so not always on like in your Machine example. I want to only measure between a period of time: OnWatch and OffWatch for each person (Machine). I have a spreadsheet of date & time for when persons (MachineName) starts and finishes a watch cycle (OnWatch, OffWatch). 

 

I would like to have this table be part of the time period that measures are taken. Otherwise, the measures are scewed because the measures are calculating a time period the person (machine) is not actaully OnWatch. How would I go about implementing a table (as described above) for the measuring to take into account?

@AdamPSP I could see a couple different approaches to this, you could filter your data table to only the rows where the machine is "onwatch" and then do your calculation. Or, you could do it in the column and as part of the calculation check to see if it falls into an "onwatch" period.

@Greg_Deckler Thank you for getting back to this question. I am new to Power BI and how the various filters would work for this example so please excuse me if I'm missing something seemingly simple. 

 

Ultimately, I want to measure using your “uptime” calculation for only the period of time a person or persons (machines) are “On Watch”. Since the persons (machines) have various OnWatch and OffWatch times, I think this calculation has to be done using a table on the actual vs a set scheduled date & time.

 

In the below image, each machine would have different start date & time for “OnWatch & OffWatch” periods. During this On Watch period of time (approximately 15 days), there would be uptime & failures for each machine. I would only want to measure the available (On Watch) machines for their uptime and failures.

 

OnWatch.png

 

 

Polls
What is your favorite Power BI Feature release this month?