Use R for cluster analysis in Power BI

v-lili6-msft · ‎02-25-2021

Definition：

Cluster analysis is a data reduction technique that aims to reveal a subset of observations in a data set. An important use of cluster analysis is to divide the target groups with multiple indicators. Such kind of classification of target groups is the basis and core of refined management and personalized operation. Only when the correct classification is carried out can the individual be effectively differentiated including refined operations, services and product support.

Scenario:

Sometimes we will encounter requirements not only to aggregate and analyze data, but also to divide data groups into more meaningful categories based on given indicators so that the analysis can be refined. Now you can try the hierarchical cluster analysis method introduced in this article.

Sample data:

The following table lists the eight main variables of the average annual consumption expenditure per capita of urban households in 31 provinces, municipalities directly under the central government and autonomous regions across the country in 1999.

Preparation:

Install R.

We can download and install R for free from the Microsoft R Application Network or the CRAN Repository. (Latest version: R-4.0.3 for Windows (32/64 bit))

Install needed packages in R.

Launch R application.
Secure one CRAN mirror and choose needed packages. Or you can paste the code below to install packages and select one CRAN mirror closest.
install.packages("NbClust")

Enable R visuals in Power BI Desktop.

For the details, please refer to this document: Create Power BI visuals using R.

Please check the restrictions and limitations mentioned in this official document.

Operations:

Import the sample data into Power BI Desktop.
Create an R visual.

Put all columns needed into “Values” field of the R visual.

Firstly, let's observe the table of sample data.

# display the dataset

library(gridExtra) # gridExtra package provides a number of user-level functions to work with "grid" graphics, notably to arrange multiple grid-based plots on a page, and draw tables.

grid.table(dataset) # display dataset as a table

Note: Duplicated rows will be removed from the data. If don’t want this, you can add a column with no duplicate values, such as an Index column.

We can look at the dendrogram results obtained by using different clustering methods.

#Reimport the data and set the value of the first column as a row variable

a<-read.table("your file address",sep=",",header=T,row.names=1)

#Standardize variables

a.scaled <- scale(a)

#Calculate instance distances

d <- dist(a.scaled)

hc1<-hclust(d, "complete") #Complete Linkage

hc2<-hclust(d, "average") #Average Linkage

hc3<-hclust(d, "centroid") #Centroid

hc4<-hclust(d, "ward.D2") #Ward

par(mfrow=c(2,2)) #Form a 2x2 picture matrix

plot(hc1, hang=-1, cex=.8, main="Average Linkage Clustering") #Complete Linkage

plot(hc2, hang=-1, cex=.8, main="Average Linkage Clustering") #Average Linkage

plot(hc3, hang=-1, cex=.8, main="Average Linkage Clustering") #Centroid

plot(hc4, hang=-1, cex=.8, main="Average Linkage Clustering") #Ward

The Ward method is more sensitive to outliers and tends to aggregate classes with a small number of observations, and it is easier to generate classes with roughly the same number of observations. From the above four results, the Ward method better meets the actual clustering requirements. Therefore, we choose Ward method for the next cluster analysis.

Select the number of clusters

We can choose to divide the 31 regions into 3 categories.

We can achieve the following clustering results.

#Divide the tree diagram into three categories

clusters4 <- cutree(hc4, k=3)

#Get the median of each category

aggregate(a, by=list(cluster=clusters4), median)

#Get the median of each category after standardization

aggregate(as.data.frame(a.scaled), by=list(cluster=clusters4),

median)

#Get the dendrogram

plot(hc4, hang=-1, cex=.8,

main="Average Linkage Clustering\n3 Cluster Solution")

#Overlay three types of solutions

rect.hclust(hc4, k=3)

Result analysis

We can display the classified areas on the map to obtain a more intuitive result.

Obviously, in 1999, Beijing, Zhejiang, Shanghai, and Guangdong, which belonged to Category I, were the regions with the most developed economy and the highest consumption level of urban residents in China. Areas belonging to the second category, such as Tianjin, Jiangsu, and Chongqing, were basically areas with a medium level of economic development and urban residents' consumption level in China. Areas belonging to the third category, such as Shanxi and Gansu, were basically underdeveloped areas in China, and the consumption level of urban residents was also low.

Summary:

R has advantages in analyzing data categories and can generate specific classifications. When we are unable to classify data instances and analyze the relationship in Power BI, please try to create R visuals. The cooperation effect between R and Power BI will be unexpectedly great.

Author: Yuna

Reviewers: Zoey

Use R for cluster analysis in Power BI

How to provide sample data in the Power BI Forum

Re: Power BI Tip: How To Sort A Column With Duplic...

Enhanced Data Profiling in Power Query: GUI and Ta...

Creating a “custom” or “hybrid” matrix in PowerBI

Supercharge Your Visuals: Easy Conditional Formatt...