Re: Find & count certain for each column

jk91 · ‎08-14-2018

Hi all,

My first post here, did some searching but could not find anything that resembles what I'd like to have.

I am working on data that has some shortcomings: many cells are "Unknown", "?", "??" or NULL, 0, Empty. Now, I'd like to count how often such values occur per column. I have created something that kinda does what I want. I have copied the existing data source and then used R to do this:

# 'dataset' holds the input data for this script
output2 <- dataset
library(tidyverse)

unknown_values = c("Unknown", "UNKNOWN", "???", "??", "?")

isUnknown <- function(x){
  x %in% unknown_values
}

empty_values = c("0", 0, " ", "", NULL, "NULL")

isEmpty <- function(x){
  x %in% empty_values
}

output2 = output2 %>%
filter(Status == "Production") %>%
select(-one_of(c("OSStd", "OSSPStd"))) %>%  # replace to your needs
gather() %>%
group_by(key) %>%
summarise_all(
  funs(
    'Unknown' = sum(isUnknown(.)), 
    'Empty' = sum(isEmpty(.))
  )
) %>%
as.data.frame

This solution works but I would also like to filter using the dataframe that my R code spits out which unfortunately does not work out of the box. Makes sense however, because I have two dataframes that are not connected.

An example: a column in the original dataframe (1) is AssetType and another column in datasource (marked in blue).

AssetType

Server

Desktop

0

Empty

Null

Workstation

...

My R code creates a row in a secondary dataframe (2, marked in yellow) as follows:

key -- Unknown -- Empty

AssetType -- 201 -- 16

Now what I would like to happen, is that the table marked in blue only counts the rows where assetType is empty or unknown, which comes down to approx. 8.5% of 2639 = 222 in total. I could do this by making two measures (one for unknown, another one for empty) for each column that I have in dataframe 1, although I am not sure how I would have to select columns in a similar way as shown in the yellow part of the image above.

Is there a better way to do this or a way to link the two dataframes so that the filters will be applied preferably in a bidirectional way?

Kind regards,

Jacco

Stachu · ‎08-14-2018

I find it a bit difficult to understand what are you trying to achieve, can you post sample few rows from the both tables (anonymised)

Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

jk91 · ‎08-14-2018

Dataframe 1Dataframe 2

Df1 contains rows with info about systems. Df2 contains rows that in essence are counts of the columns from Df1.

Regards

Stachu · ‎08-14-2018

ok, so you want to count rows in Dataframe1, that meet certain criteria

in DAX this can be achieved like this

AssetType_Empty= CALCULATE(COUNTROWS(DF1),DF1[AssetType] = BLANK())
AssetType_Unknown= CALCULATE(COUNTROWS(DF1),DF1[AssetType] = "Unknown")

assuming Dataframe2 is supposed to only serve as a source for the dropdown in the yellow visual, the following syntax would work

Empty = 
VAR ColumnName = SELECTEDVALUE(DF2[key])
RETURN
SWITCH(ColumnName,
    "AssetType",[AssetType_Empty],
    "Other column",[Other column_Empty],
    BLANK()
)

then very similiar measure for Unknown
Is this what you were looking for, or do you look for a way to modify values in DF2?

Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

jk91 · ‎08-17-2018

Hi Stachu,

Thanks for your help. Using your solution, I'd have to manually create two measures (x_empty & x_unknown) for each column x I guess? And for an overall accuracy / integrity measure i'd have to combine those measures into a single score. That could work although it would be quite a hassle with 35 rows hehe.

Kind regards

Stachu · ‎08-17-2018

you could create just 2 measures, and include the column specific ones as variables

Empty = 
VAR ColumnName = SELECTEDVALUE(DF2[key])
VAR AssetType_Empty = CALCULATE(COUNTROWS(DF1),DF1[AssetType] = BLANK())
VAR Other_column_Empty = CALCULATE(COUNTROWS(DF1),DF1[Other_column] = BLANK())
RETURN
SWITCH(ColumnName,
    "AssetType",AssetType_Empty,
    "Other column",Other_column_Empty,
    BLANK()
)

if you could unpivot the columns so they would all be in 2 columns like ColumnName & ColumnValue then it's just 2 very simple measures. This approach requires some drastic changes to your data model

Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

jk91 · ‎08-17-2018

Thanks again, I am not sure whether I have made myself as clear as I would like so I'll try to rephrase. What I need is a score for data quality for both the whole dataframe(df1) and every single column in there. My R code simply counts the occurence of empty or unknown per column and saves those two counts in a second dataframe. However, the two df's have different structures and therefore they do not support cross-filtering.

What I'd like to have is that I can select a column (e.g. AssetType) using a slicer and that when I apply a filter (e.g. AVClient = "McAfee") the metrics update and only shows the unknown and empty counts for each row where AVClient equals McAfee.

However, if I deselect the AssetType column, it should show a score for the whole dataframe where AVClient = McAfee. I guess this requires looping over every cell in the df or calculating two scores for each column beforehand.

Stachu · ‎08-17-2018

what you're describing in regards to filters is a default behaviour, and assuming I understand you correctly this is how measure works now
so e.g.
you filter AssetType out of DF2 column Key to show Empty in AssetTpe column
whatever kind of filter you apply on DF1 (e.g. slicer DF1[AVClient] = McAfee) will affect the result of the Measure I posted

Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

Find & count certain values for each column

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly