Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
bigsky5829
Frequent Visitor

need help with percentile.exc filtering to specific rows

I have one table that has the following information:

 

table1:

District, Week of Year, Employee, Tenure weeks

 

My goal is to determine, for each district and week combination, the first quintile for employee tenure excluding employees less than 25 weeks.

 

I used power query to create table2, with table1 as the source and the following transforms:

  • remove rows where Tenure weeks is less than 26
  • Group by District and Week of Year, count rows
  • remove counts <11

     

    I then tried the following DAX queries

 

 

firstquintile = CALCULATE(PERCENTILE.EXC(table1[Tenure weeks],0.2),FILTER(table1,table1[Tenure weeks] > 26 && table1[Cost Center District] = table2[Cost Center District] && table1[Week of Year] = table2[Week of Year]))

 

This returns an error "percentile value should be in the range 1/(n+1)...n/(n+1)", which would imply that there are cases where n is less than four.  But I have forced all ns to be at least 11.  Futhermore, drilling down to specific District and Week of Year combinations that produced errors, prove the n to be over 11.  I also noted by adding and removing filters (in power query), I would get inconsistent results... sometimes getting errors and sometimes not for identical filters.  It was enough to make me feel like I was being gaslit.**

 

Attempt #2:

 

 

firstquintile2 = PERCENTILE.EXC(SELECTCOLUMNS(FILTER(table1,table1[Tenure weeks] > 26 && table1[Cost Center District] = table2[Cost Center District] && table1[Week of Year] = table2[Week of Year]),"tenure",[Tenure weeks]),0.2)

 

The problem with this query is SELECTCOLUMNS returns a table and PERCENTILE.EXC requires a column

 

Attempt #3:

 

 

firstquintile2 = PERCENTILE.EXC(FILTER(table1,tixlochrsmaster[Tenure weeks] > 26 && table1[Cost Center District] = table1[Cost Center District] && table1[Week of Year] = table2[Week of Year])[Tenure weeks],0.2)

 

This returns a syntax error.  It appears referring to the columnname after the filter function is not a valid syntax.

 

I'm at a complete loss how to do this.

 

**Gas lighting examples:

Filtered to these two districts returns values:

bigsky5829_0-1618673277145.png

This district returns values:

bigsky5829_2-1618673324262.png

But this combination returns an error?????

bigsky5829_3-1618673353444.png

 

8 REPLIES 8
bigsky5829
Frequent Visitor

@daxer-almighty 

 

Thanks for the suggestions and the resources.  I spent the last few hours sanitizing the data and recreating my efforts on the sanitized data.  I'm not looking to win any beauty awards, just hoping to learn something and get on with solving a wider problem.

 

Here is the data:

https://www.dropbox.com/s/moz7t5sle54y4o3/sanitizeddata.csv?dl=0

 

I'm going to split the rest of this post into 3 sections:

  1. My efforts in Power Query Editor in Excel
  2. My efforts in Power Query Editor/DAX in Power BI
  3. A crappy workaround that I have that raises more questions than answers.

 

Part 1:

 

I first did this in Power Query only:

 

let
    Source = Csv.Document(File.Contents("C:\Users\TristenHannah\Dropbox (USICLLC)\USIC\Capacity Model\cap model sanitized data\sanitizeddata.csv"),[Delimiter=",", Columns=13, Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"Cost Center Supervisor Group", type text}, {"Employee", type text}, {"Employee ID", Int64.Type}, {"Cost Center District", type text}, {"Week Ending Date", type datetime}, {"Miles Walked", Int64.Type}, {"Worked Hours", Int64.Type}, {"Nails Hammered", Int64.Type}, {"Employee Job Title", type text}, {"Employee Original Hire Date", type datetime}, {"Tenure weeks", Int64.Type}, {"Week of Year", Int64.Type}, {"NPH", type number}})
in
    #"Changed Type"

 

let
    Source = sanitizeddata,
    #"Grouped Rows" = Table.Group(Source, {"Cost Center District", "Week of Year"}, {{"Count", each Table.RowCount(_), Int64.Type}}),
    #"Merged Queries" = Table.NestedJoin(#"Grouped Rows", {"Cost Center District", "Week of Year"}, sanitizeddata, {"Cost Center District", "Week of Year"}, "joineddata", JoinKind.LeftOuter),
    #"Removed Columns" = Table.RemoveColumns(#"Merged Queries",{"Count"}),
    addcolperc1 = Table.AddColumn(#"Removed Columns", "percentile1", each List.Percentile(Table.SelectRows([joineddata], (IT) => IT[#"Tenure weeks"] > 26)[#"Tenure weeks"],0.2)),
    addcolperc2 = Table.AddColumn(addcolperc1, "percentile2", each List.Percentile(Table.SelectRows([joineddata], (IT) => IT[#"Tenure weeks"] > 26)[#"Tenure weeks"],0.4)),
    addcolperc3 = Table.AddColumn(addcolperc2, "percentile3", each List.Percentile(Table.SelectRows([joineddata], (IT) => IT[#"Tenure weeks"] > 26)[#"Tenure weeks"],0.6)),
    addcolperc4 = Table.AddColumn(addcolperc3, "percentile4", each List.Percentile(Table.SelectRows([joineddata], (IT) => IT[#"Tenure weeks"] > 26)[#"Tenure weeks"],0.8)),
    maxvalue = Table.AddColumn(addcolperc4, "maximumvalue", each List.Max(Table.SelectRows([joineddata], (IT) => IT[#"Tenure weeks"] > 26)[#"Tenure weeks"]))
in
    maxvalue

 

Pros: it works!

Cons: it may not be done running when I wake up in the morning.

 

I really need this to run in a few minutes to be viable.

 

Part 2:

 

Loaded Data into PQ:

let
    Source = Csv.Document(File.Contents("C:\Users\TristenHannah\Dropbox (USICLLC)\USIC\Capacity Model\cap model sanitized data\sanitizeddata.csv"),[Delimiter=",", Columns=13, Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"Cost Center Supervisor Group", type text}, {"Employee", type text}, {"Employee ID", Int64.Type}, {"Cost Center District", type text}, {"Week Ending Date", type datetime}, {"Miles Walked", Int64.Type}, {"Worked Hours", Int64.Type}, {"Nails Hammered", Int64.Type}, {"Employee Job Title", type text}, {"Employee Original Hire Date", type datetime}, {"Tenure weeks", type number}, {"Week of Year", type number}, {"NPH", type number}})
in
    #"Changed Type"

 

Created second table to group into the correct dimensions I want the quintiles.  Prior to grouping, removed all records with tenure less than 26 weeks.  Since I want the flexibility to calculate down to deciles, I removed all records with a count 10 or less (which will throw an error if I try to calculate less than 1/(N+1).

let
    Source = sanitizeddata,
    #"Filtered Rows1" = Table.SelectRows(Source, each [#"Tenure weeks"] > 26),
    #"Grouped Rows" = Table.Group(#"Filtered Rows1", {"Cost Center District", "Week of Year"}, {{"Count", each Table.RowCount(_), Int64.Type}}),
    #"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each [Count] > 10),
    #"Sorted Rows" = Table.Sort(#"Filtered Rows",{{"Cost Center District", Order.Ascending}, {"Week of Year", Order.Ascending}})
in
    #"Sorted Rows"

 

This calculated column works:

firstquintilecount = CALCULATE(COUNTROWS(sanitizeddata),FILTER(sanitizeddata,sanitizeddata[Tenure weeks] > 26 && sanitizeddata[Cost Center District] = tenurepercentiles[Cost Center District] && sanitizeddata[Week of Year] = tenurepercentiles[Week of Year]))

 

So does this one:

sumtest = CALCULATE(SUM(sanitizeddata[Tenure weeks]),FILTER(sanitizeddata,sanitizeddata[Tenure weeks] > 26 && sanitizeddata[Cost Center District] = tenurepercentiles[Cost Center District] && sanitizeddata[Week of Year] = tenurepercentiles[Week of Year]))

 

Does not work:

firstquintile = CALCULATE(PERCENTILE.EXC(sanitizeddata[Tenure weeks],0.2),FILTER(sanitizeddata,sanitizeddata[Tenure weeks] > 26 && sanitizeddata[Cost Center District] = tenurepercentiles[Cost Center District] && sanitizeddata[Week of Year] = tenurepercentiles[Week of Year]))

Error: the percentile value should be in the range 1/(N+1)...N/(N+1) inclusive, where N is the number of data values.

 

The error seems impossible, because I have forced N to be at least 11, so 0.2 is in the range.

 

Part 3:

 

I was actually able to get the formula to run, and it actually runs very fast.  Here's how.   For the second table above, use this code:

let
    Source = sanitizeddata,
    #"Filtered Rows1" = Table.SelectRows(Source, each [#"Tenure weeks"] > 26),
    #"Grouped Rows" = Table.Group(#"Filtered Rows1", {"Cost Center District", "Week of Year"}, {{"Count", each Table.RowCount(_), Int64.Type}}),
    #"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each [Count] > 10),
    #"Sorted Rows" = Table.Sort(#"Filtered Rows",{{"Cost Center District", Order.Ascending}, {"Week of Year", Order.Ascending}}),
    #"Filtered Rows2" = Table.SelectRows(#"Sorted Rows", each ([Cost Center District] <> "C1" and [Cost Center District] <> "C10" and [Cost Center District] <> "C11" and [Cost Center District] <> "C13" and [Cost Center District] <> "C14" and [Cost Center District] <> "C15" and [Cost Center District] <> "C16" and [Cost Center District] <> "C17" and [Cost Center District] <> "C18" and [Cost Center District] <> "C2" and [Cost Center District] <> "C3" and [Cost Center District] <> "C4" and [Cost Center District] <> "C5" and [Cost Center District] <> "C6" and [Cost Center District] <> "C7" and [Cost Center District] <> "C8" and [Cost Center District] <> "C9"))
in
    #"Filtered Rows2"

This is identical to the second bit of code in part 2, except with the additional filter at the end that removes all C districts.  What's special about C districts?  For one, I don't need them in the model.  They are part of other business units that don't benefit from this model.  The fact they are even in the source data is a bit suspect, since they don't produce activity that would normally result in this data.

 

For example, filtering to C11, Tenure weeks > 26, Week of Year = 1

bigsky5829_0-1618716014569.png

I know for a fact these employees are not part of C11, and you can tell by the data it looks like garbage numbers.  But... the Tenure weeks DOES offer enough to calculate quintiles, so I have no idea why it would fail when C11 is included.  We've forced all District/Week combinations to have at least 11 good datapoints.

 

Conclusion

So the Power Query only version works, but at several hours to run it isn't practical at all.  And the DAX version I can make work, but it fails in certain circumstances I don't understand and it makes me worried about relying on that as a solution.

 

I'm definitely open to any suggestions!  I will read through the references you provided and continue my youtube education.

daxer-almighty
Solution Sage
Solution Sage

But... if you really insist on the WRONG way to do it, then you'll have to show us the model. No way around it.

daxer-almighty
Solution Sage
Solution Sage

By the way... to add to my last post, if your data is in the region of millions of rows, then calculating something row by row in DAX will be sloooooooow out of necessity unless, of course, you know how to write optimized DAX. Again, as I said, this all should be calculated in PQ for at least 3 reasons: 1) It's the RIGHT way to do it. 2) It'll be maintainable. 3) If done properly, it'll be extremely fast.

daxer-almighty
Solution Sage
Solution Sage

@bigsky5829 

 

First, it does not look to be a query. Looks to be a measure or a calculated column (which are very different things).

 

Second, if there's no relationship between the tables, then a naked reference to a column in the tables will surely return an error; even if there is a relationship, it's not just referring to a column that will work - you have to wrap the reference into something like RELATED(TABLE) or SELECTEDVALUE.

 

Measures need aggregations. You can't just do this:

table1[Cost Center District] = table2[Cost Center District] && table1[Week of Year] = table2[Week of Year]

 and expect it to work. DAX does not work this way.

 

On top of that, I'm not even sure you are correctly removing the employees with less than 26 tenure years but I'm going to leave it.

 

I would strongly and kindly suggest you read this before doing any serious PBI work: Understand star schema and the importance for Power BI - Power BI | Microsoft Docs And when you're done with this, it would also be advisable that you read the other articles. If you invest into reading the docs, this will be time very well spent. I know what I'm saying.

Thanks for the response.  I read the link you sent, and I think I understand generically, though not every single detail.

 

I am using a formula for a calculated column.  From what I read, an aggregator is not necessary for calculated columns?  If I use the first formula I came up with, and I replaced PERCENTILE.EXC with SUM, it runs fine.  I spot check some rows and the calculation is perfect.  So it feels like I'm pretty close.

 

If may help if I describe a little more what I'm trying to do.  Ultimate I'm trying to build a model in Excel for the finance team to use--they are not power bi or power query users.  The data is in the millions of rows, and there are some pretty intensive calculations that need to be made.  Ultimately I'm trying to import the data in CSVs, run the calculations and group the data into much smaller tables designed for the model, and export those into excel.  I ultimately will build 4 or 5 tables of a few thousands rows a piece and then use excel for the rest of the model.  So I'm not trying to build a fully functional model and star schema in Power BI and make great visualizations or anything... I'm just trying to process data faster than I can in the other tools I've tried.

 

Forgive my ignorance, but I'm not a programmer, I'm just working on this in my spare time to help the team.  With a lot of help from my tutors Google and Youtube, I wrote this in power query but it took too long to load.  I can almost certainly figure this out in TSQL, but I'm trying DAX in Power BI first.

Yeah, I understand that. But if you only need to create tables and then load them into Excel... well, then DAX is NOT the tool to use. Power Query is. And PQ can be blazingly fast but one has to know how to program for performance in it. Not everything that "works" should be put to practice, obviously. If you really want to create the tables, then you'll have to learn how to properly use PQ and M (the functional language it uses). I learned a lot about PQ from this:

 

GuestPost: Newbie to Newbie Learning M-Language as your first Programming Language – The BIccountant

 

and

 

Power Query M Primer (part 1):Introduction, Simple Expressions & let | Ben Gribaudo

 

The last one is especially worth reading.

The second link is very helpful, thank you!

@bigsky5829 

 

OK. When I have time I might take a look at this but please don't expect me to give you a quick solution overnight.

 

But I'll give you a good piece of advice. Go and find the blog that is owned by Chris Webb. He's a specialist in everything BI, especially cubes, and has tonnes of experience. He's got brililiant articles on performance in Power Query. You definitely should start with his thoughts on the performance of the M language.

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors