Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
jk91
Frequent Visitor

Find & count certain values for each column

Hi all,

 

My first post here, did some searching but could not find anything that resembles what I'd like to have.

 

I am working on data that has some shortcomings: many cells are "Unknown", "?", "??" or NULL, 0, Empty. Now, I'd like to count how often such values occur per column. I have created something that kinda does what I want. I have copied the existing data source and then used R to do this:

 

# 'dataset' holds the input data for this script
output2 <- dataset
library(tidyverse)

unknown_values = c("Unknown", "UNKNOWN", "???", "??", "?")

isUnknown <- function(x){
  x %in% unknown_values
}

empty_values = c("0", 0, " ", "", NULL, "NULL")

isEmpty <- function(x){
  x %in% empty_values
}

output2 = output2 %>%
filter(Status == "Production") %>%
select(-one_of(c("OSStd", "OSSPStd"))) %>%  # replace to your needs
gather() %>%
group_by(key) %>%
summarise_all(
  funs(
    'Unknown' = sum(isUnknown(.)), 
    'Empty' = sum(isEmpty(.))
  )
) %>%
as.data.frame

 

This solution works but I would also like to filter using the dataframe that my R code spits out which unfortunately does not work out of the box. Makes sense however, because I have two dataframes that are not connected.

 

An example: a column in the original dataframe (1) is AssetType and another column in datasource (marked in blue). 

 

AssetType

 

Server

Desktop

0

Empty

Null

Workstation

...

 

My R code creates a row in a secondary dataframe (2, marked in yellow) as follows:

 

key -- Unknown -- Empty

AssetType -- 201 -- 16

 

PBI.PNG

 

Now what I would like to happen, is that the table marked in blue only counts the rows where assetType is empty or unknown, which comes down to approx. 8.5% of 2639 = 222 in total. I could do this by making two measures (one for unknown, another one for empty) for each column that I have in dataframe 1, although I am not sure how I would have to select columns in a similar way as shown in the yellow part of the image above.

 

Is there a better way to do this or a way to link the two dataframes so that the filters will be applied preferably in a bidirectional way?

 

Kind regards,

 

Jacco

7 REPLIES 7
Stachu
Community Champion
Community Champion

I find it a bit difficult to understand what are you trying to achieve, can you post sample few rows from the both tables (anonymised)



Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

jk91
Frequent Visitor

Dataframe 1Dataframe 1Dataframe 2Dataframe 2

Df1 contains rows with info about systems. Df2 contains rows that in essence are counts of the columns from Df1.

 

Regards

Stachu
Community Champion
Community Champion

ok, so you want to count rows in Dataframe1, that meet certain criteria

in DAX this can be achieved like this

AssetType_Empty= CALCULATE(COUNTROWS(DF1),DF1[AssetType] = BLANK())
AssetType_Unknown= CALCULATE(COUNTROWS(DF1),DF1[AssetType] = "Unknown")

assuming Dataframe2 is supposed to only serve as a source for the dropdown in the yellow visual, the following syntax would work

Empty = 
VAR ColumnName = SELECTEDVALUE(DF2[key])
RETURN
SWITCH(ColumnName,
    "AssetType",[AssetType_Empty],
    "Other column",[Other column_Empty],
    BLANK()
)

then very similiar measure for Unknown
Is this what you were looking for, or do you look for a way to modify values in DF2?

 

 



Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

jk91
Frequent Visitor

Hi Stachu,

 

Thanks for your help. Using your solution, I'd have to manually create two measures (x_empty & x_unknown) for each column x I guess? And for an overall accuracy / integrity measure i'd have to combine those measures into a single score. That could work although it would be quite a hassle with 35 rows hehe. 

 

Kind regards

 

 

Stachu
Community Champion
Community Champion

you could create just 2 measures, and include the column specific ones as variables

Empty = 
VAR ColumnName = SELECTEDVALUE(DF2[key])
VAR AssetType_Empty = CALCULATE(COUNTROWS(DF1),DF1[AssetType] = BLANK())
VAR Other_column_Empty = CALCULATE(COUNTROWS(DF1),DF1[Other_column] = BLANK())
RETURN
SWITCH(ColumnName,
    "AssetType",AssetType_Empty,
    "Other column",Other_column_Empty,
    BLANK()
)

if you could unpivot the columns so they would all be in 2 columns like ColumnName & ColumnValue then it's just 2 very simple measures. This approach requires some drastic changes to your data model



Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

jk91
Frequent Visitor

Thanks again, I am not sure whether I have made myself as clear as I would like so I'll try to rephrase. What I need is a score for data quality for both the whole dataframe(df1) and every single column in there. My R code simply counts the occurence of empty or unknown per column and saves those two counts in a second dataframe. However, the two df's have different structures and therefore they do not support cross-filtering.

 

What I'd like to have is that I can select a column (e.g. AssetType) using a slicer and that when I apply a filter (e.g. AVClient = "McAfee") the metrics update and only shows the unknown and empty counts for each row where AVClient equals McAfee.

 

However, if I deselect the AssetType column, it should show a score for the whole dataframe where AVClient = McAfee. I guess this requires looping over every cell in the df or calculating two scores for each column beforehand.

 

Stachu
Community Champion
Community Champion

what you're describing in regards to filters is a default behaviour, and assuming I understand you correctly this is how measure works now
so e.g.
you filter AssetType out of DF2 column Key to show Empty in AssetTpe column
whatever kind of filter you apply on DF1 (e.g. slicer DF1[AVClient] = McAfee) will affect the result of the Measure I posted



Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors