Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
diagnosed
Frequent Visitor

Create a measure for the average of a set of values excluding a set percentage

I feel this is an incredibly simple question, but I can't get my data set to work quite right so I'm turning to the forum.

 

Let's say that I have a set of data where I'm looking to calculate the time it takes a helpdesk agent to resolve a ticket, but our system occasionally produces outliers that I want to filter for. So, we have a data set which looks something like this (in minutes):

 

10

15

13

20

23

17

11

9

190

11

 

Where we have 90% of the data set which is valid, and anticipate 10% of the data set are outliers we want to filter for. If we average for the all-up for the set, we get 31.9; if we do a percentile for 0.9, we get 173; when the value we want to return (sans the upper 10% of the data set, here the 190) is 14.3.

Hopefully this is a quick and easy for somebody to answer, but it'd be very appreciated.

1 ACCEPTED SOLUTION
jdbuchanan71
Super User
Super User

Hello @diagnosed 

Give something like this a try where you get the max time then filter out the top 10%:

Avg limited =
VAR outlier =
    MAX ( Tickets[Time] )
RETURN
    CALCULATE ( AVERAGE ( Tickets[Time] ), Tickets[Time] < outlier * .9 )

If you are wanting to take of only the top 10% of the entire data set it will be.

Avg limited 2 =
VAR outlier =
    CALCULATE ( MAX ( Tickets[Time] ), ALL ( Tickets ) )
RETURN
    CALCULATE ( AVERAGE ( Tickets[Time] ), Tickets[Time] < outlier * .9 )

 

View solution in original post

7 REPLIES 7
diagnosed
Frequent Visitor

For my needs, the less stringent definition and just grabbing outside of the upper 10% was fine, but this is all absolutely tremendous information and incredibly useful to me as I build out my data set - thank you both so much for this help.

Anonymous
Not applicable

First, what's an outlier?

 

An outlier is an observation that lies outside the overall pattern of a distribution (Moore and McCabe 1999). Usually, the presence of an outlier indicates some sort of problem. This can be a case which does not fit the model under study, or an error in measurement.

 

A convenient definition of an outlier is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile.

 

For your needs, it's the upper bound that counts. You have to calculate it and then filter the data in question for non-outliers. Then average out. The definition you've been given by @jdbuchanan71 is not correct by any statistical means.

 

Best

Darek

Interesting @Anonymous .  So it would be more like this from a statistical standpoint?

Avg limited 3 = 
VAR FirstMedian = CALCULATE(MEDIAN(Tickets[Time]),ALL(Tickets))
VAR LowMedian = CALCULATE(MEDIAN(Tickets[Time]),ALL(Tickets),Tickets[Time]<=FirstMedian)
VAR HighMedian = CALCULATE(MEDIAN(Tickets[Time]),ALL(Tickets),Tickets[Time]>=FirstMedian)
RETURN CALCULATE(AVERAGE(Tickets[Time]),Tickets[Time] < HighMedian * 1.5)

I posted the sample file I was messing with for you to take a look at.

https://www.dropbox.com/s/5u9cca24iwy074p/Tickets%20Median.pbix?dl=0

 medianoutliers.jpg

Anonymous
Not applicable

Hi there. Well, not really. It just so happens that I'm a mathematician as well (to give you a bit more credentials). So, let's see:

 

"... is a point which falls more than 1.5 times the interquartile range above the third quartile [...]"

 

What's the interquartile range? Well, it's IQR = (Q3 - Q1)/2. In your terms, it would be

 

IQR = (HighMedian - LowMedian) / 2.

 

Then, the upper boundary would be:

 

UB = HighMedian + 1.5 * IQR.


This is the upper boundary above which an observation should be treated as a possible outlier.

 

Your formula requires just a slight change but an IMPORTANT ONE.

 

Best

Darek

Thanks @Anonymous for the clarification.  So we want.

Avg Limited 3 = 
VAR FirstMedian = CALCULATE(MEDIAN(Tickets[Time]),ALL(Tickets))
VAR LowMedian = CALCULATE(MEDIAN(Tickets[Time]),ALL(Tickets),Tickets[Time]<=FirstMedian)
VAR HighMedian = CALCULATE(MEDIAN(Tickets[Time]),ALL(Tickets),Tickets[Time]>=FirstMedian)
VAR IQR = DIVIDE ( HighMedian - LowMedian, 2 )
VAR UB = HighMedian + 1.5 * IQR

RETURN CALCULATE( AVERAGE( Tickets[Time] ), Tickets[Time] < UB )

Which shifts my UB from 121.5 down to 118.5

outliersubcalc.jpg

Anonymous
Not applicable

Yep, that’s correct.
jdbuchanan71
Super User
Super User

Hello @diagnosed 

Give something like this a try where you get the max time then filter out the top 10%:

Avg limited =
VAR outlier =
    MAX ( Tickets[Time] )
RETURN
    CALCULATE ( AVERAGE ( Tickets[Time] ), Tickets[Time] < outlier * .9 )

If you are wanting to take of only the top 10% of the entire data set it will be.

Avg limited 2 =
VAR outlier =
    CALCULATE ( MAX ( Tickets[Time] ), ALL ( Tickets ) )
RETURN
    CALCULATE ( AVERAGE ( Tickets[Time] ), Tickets[Time] < outlier * .9 )

 

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.