Merging

ankitpatira · ‎06-15-2016

Purpose of this blog post is to demonstrate how to use R script to load and prepare data source in Power BI desktop. R is open source powerful statistical analysis and visualisation language. Integrating R in Power BI lets you undertake complex data manipulation tasks. In this post I am going to use two sample csv dataset that contains raw data for import and export amounts for various Australian territories, and using R transform it to right format to be visualised in Power BI desktop. I am only going to cover basics of R to get data into the shape required so that you can use more interesting functions of R in future.

Prerequisites

For this post I am assuming you have basic understanding of R and its concepts such as data frame, packages (though we will not require any special package to be installed for this tutorial) and its syntax. Almost everything that you can think of can be done with functions in R and for this tutorial we are only going to work with base R functions.

Setup

Download and install latest version of R for windows from https://cran.r-project.org/bin/windows/base/.

Once R is installed in powerbi desktop ensure R data source is pointed to the correct location. If you have selected default location during R installation no change is required however if you’ve used custom location then you need to point R data source in powerbi desktop accordingly. To update that in powerbi desktop go to File > Options and Settings > Options > R Scripting > Detected R home directories > Other and browse to the location where R is installed.

Import

To import data into powerbi using R, go to Get Data > Other > R Script > Connect which will give you dialog box to enter in R scripts. Files I've used for this post is attached below.

Copy and paste below code that will import contents of csv file into a table called input and see in preview navigator. Using ‘~//’ will look for csv file in the default folder of R installation which is in Documents folder however you can specify custom path using double backslashes for example C:\\Users\\abc\\Desktop\\AU_Import.csv. It is also good practise to explicitly mention separator and you can use comma or tab delimiter as required.

#use read.csv method to import data
input <- read.csv("~//AU_Import.csv", stringsAsFactors=FALSE, sep=",")

Formatting & Sorting

Majority of times not all the data available is useful and you may only need the small subset of data from the actual file. Here we are only interested in columns Frequency, Commodity, Time and Value. Also changing column names to more meaningful ones.

#use read.csv method to import data
input <- read.csv("~//AU_Import.csv", stringsAsFactors=FALSE, sep=",")

#excluding unwanted columns and change column names
keepcolumns <- c("State.of.Destination","Frequency","Commodity.by.SITC","Time","Value")
input <- input[,keepcolumns]
colnames(input)[3] <- c("Commodity")
colnames(input)[1] <- c("State") 
colnames(input)[5] <- c("ImportValue")

If you notice Time column has values in 'mmm-yyyy’ format which will be recognized as characters by R and text based field in power bi and therefore won’t allow time based slicing and dicing. To convert values into proper date format we will make use of several R functions. Using Paste, Substr functions to extract month name out of values in Time column, convert it to month number, append ‘01’ to it and paste that to last 5 characters of Time column resulting in Date column in ‘dd-mm-yyyy’ format. Time column is then not required and can be removed.

#use read.csv method to import data
input <- read.csv("~//AU_Import.csv", stringsAsFactors=FALSE, sep=",")

#excluding unwanted columns and change column names
keepcolumns <- c("State.of.Destination","Frequency","Commodity.by.SITC","Time","Value")
input <- input[,keepcolumns]
colnames(input)[3] <- c("Commodity")
colnames(input)[1] <- c("State") 
colnames(input)[5] <- c("ImportValue")

#extract monthname
input$Date <- paste(substr(input$Time,nchar(input$Time)-3,nchar(input$Time)), paste(match(substr(input$Time, 1,3), month.abb), "01", sep="-"), sep="-")
#Time column no longer required
attach(input)
input$Time <- NULL

We will then use Date column to sort using Order function. By default sorting is ascending.

#use read.csv method to import data
input <- read.csv("~//AU_Import.csv", stringsAsFactors=FALSE, sep=",")

#excluding unwanted columns and change column names
keepcolumns <- c("State.of.Destination","Frequency","Commodity.by.SITC","Time","Value")
input <- input[,keepcolumns]
colnames(input)[3] <- c("Commodity")
colnames(input)[1] <- c("State")
colnames(input)[5] <- c("ImportValue")

#extract monthname
input$Date <- paste(substr(input$Time,nchar(input$Time)-3,nchar(input$Time)), paste(match(substr(input$Time, 1,3), month.abb), "01", sep="-"), sep="-")

#Time column no longer required
attach(input)
input$Time <- NULL

#change data type of Date column to date
input$Date <- as.Date(input$Date)

#sort by date column
input[order(Date),]

Filtering

As we’re only interested in state level data, using subset function we can filter out rows that is not useful. Using ‘!=’ operator and ‘&’ operator to specify multiple conditions we are going to excluded rows where there is no state details, commodity totals and state totals. Using Omit function will remove all rows where value is null.

#use read.csv method to import data
input <- read.csv("~//AU_Import.csv", stringsAsFactors=FALSE, sep=",")

#excluding unwanted columns and change column names
keepcolumns <- c("State.of.Destination","Frequency","Commodity.by.SITC","Time","Value")
input <- input[,keepcolumns]
colnames(input)[3] <- c("Commodity")
colnames(input)[1] <- c("State") 
colnames(input)[5] <- c("ImportValue")

#extract monthname
input$Date <- paste(substr(input$Time,nchar(input$Time)-3,nchar(input$Time)), paste(match(substr(input$Time, 1,3), month.abb), "01", sep="-"), sep="-")

#Time column no longer required
attach(input)
input$Time <- NULL

#change data type of Date column to date
input$Date <- as.Date(input$Date)

#sort by date column
input[order(Date),]

#Use subset function to exclude rows
input <- subset(input, Commodity!="TOTAL" & State!="No state details" & State!="Total")

#Use omit function to remove null rows
input <- na.omit(input)

Similarly we will prepare export data into a data frame called output using above code.

#use read.csv method to import data
output <- read.csv("~//AU_Export.csv", stringsAsFactors=FALSE, sep=",")

#excluding unwanted columns and change column names
keepcolumns <- c("State.of.Origin","Frequency","Commodity.by.SITC","Time","Value")
output <- output[,keepcolumns]
colnames(output)[3] <- c("Commodity")
colnames(output)[1] <- c("State") 
colnames(output)[5] <- c("ExportValue")

#extract monthname
output$Date <- paste(substr(output$Time,nchar(output$Time)-3,nchar(output$Time)), paste(match(substr(output$Time, 1,3), month.abb), "01", sep="-"), sep="-")

#Time column no longer required
attach(output)
output$Time <- NULL

#change data type of Date column to date
output$Date <- as.Date(output$Date)

#sort by date column
output[order(Date),]

#Use subset function to exclude rows
output <- subset(output, Commodity!="TOTAL" & State!="No state details" & State!="Total")

#Use omit function to remove null rows
output <- na.omit(output)

Merging

Finally we want to merge these two data frames into single one to be used in power bi. We will use merge function for that and specify columns we want to perform match on.

#use read.csv method to import data
input <- read.csv("~//AU_Import.csv", stringsAsFactors=FALSE, sep=",")

#excluding unwanted columns and change column names
keepcolumns <- c("State.of.Destination","Frequency","Commodity.by.SITC","Time","Value")
input <- input[,keepcolumns]
colnames(input)[3] <- c("Commodity")
colnames(input)[1] <- c("State") 
colnames(input)[5] <- c("ImportValue")
#extract monthname
input$Date <- paste(substr(input$Time,nchar(input$Time)-3,nchar(input$Time)), paste(match(substr(input$Time, 1,3), month.abb), "01", sep="-"), sep="-")

#Time column no longer required
attach(input)
input$Time <- NULL

#change data type of Date column to date
input$Date <- as.Date(input$Date)

#sort by date column
input[order(Date),]

#Use subset function to exclude rows
input <- subset(input, Commodity!="TOTAL" & State!="No state details" & State!="Total")

#Use omit function to remove null rows
input <- na.omit(input)
 

#use read.csv method to import Export data
output <- read.csv("~//AU_Export.csv", stringsAsFactors=FALSE, sep=",")

#excluding unwanted columns and change column names
keepcolumns <- c("State.of.Origin","Frequency","Commodity.by.SITC","Time","Value")
output <- output[,keepcolumns]
colnames(output)[3] <- c("Commodity")
colnames(output)[1] <- c("State") 
colnames(output)[5] <- c("ExportValue")
#extract monthname
output$Date <- paste(substr(output$Time,nchar(output$Time)-3,nchar(output$Time)), paste(match(substr(output$Time, 1,3), month.abb), "01", sep="-"), sep="-")

#Time column no longer required
attach(output)
output$Time <- NULL

#change data type of Date column to date
output$Date <- as.Date(output$Date)

#sort by date column
output[order(Date),]

#Use subset function to exclude rows
output <- subset(output, Commodity!="TOTAL" & State!="No state details" & State!="Total")

#Use omit function to remove null rows
output <- na.omit(output) 


#Use merge function to merge two dataframes
Combined <- merge(input,output,by=c("State","Frequency","Commodity","Date"))

Transforming

To calculate percentage values for each row in Import and Export columns, using basic arithmetic operation we will divide each value by sum of values in that column and then format it to display with percentage sign using Paste function.

#use read.csv method to import data
input <- read.csv("~//AU_Import.csv", stringsAsFactors=FALSE, sep=",")

#excluding unwanted columns and change column names
keepcolumns <- c("State.of.Destination","Frequency","Commodity.by.SITC","Time","Value")
input <- input[,keepcolumns]
colnames(input)[3] <- c("Commodity")
colnames(input)[1] <- c("State") 
colnames(input)[5] <- c("ImportValue")
#extract monthname
input$Date <- paste(substr(input$Time,nchar(input$Time)-3,nchar(input$Time)), paste(match(substr(input$Time, 1,3), month.abb), "01", sep="-"), sep="-")

#Time column no longer required
attach(input)
input$Time <- NULL

#change data type of Date column to date
input$Date <- as.Date(input$Date)

#sort by date column
input[order(Date),]

#Use subset function to exclude rows
input <- subset(input, Commodity!="TOTAL" & State!="No state details" & State!="Total")

#Use omit function to remove null rows
input <- na.omit(input)
 

#use read.csv method to import Export data
output <- read.csv("~//AU_Export.csv", stringsAsFactors=FALSE, sep=",")

#excluding unwanted columns and change column names
keepcolumns <- c("State.of.Origin","Frequency","Commodity.by.SITC","Time","Value")
output <- output[,keepcolumns]
colnames(output)[3] <- c("Commodity")
colnames(output)[1] <- c("State") 
colnames(output)[5] <- c("ExportValue")
#extract monthname
output$Date <- paste(substr(output$Time,nchar(output$Time)-3,nchar(output$Time)), paste(match(substr(output$Time, 1,3), month.abb), "01", sep="-"), sep="-")

#Time column no longer required
attach(output)
output$Time <- NULL

#change data type of Date column to date
output$Date <- as.Date(output$Date)

#sort by date column
output[order(Date),]

#Use subset function to exclude rows
output <- subset(output, Commodity!="TOTAL" & State!="No state details" & State!="Total")

#Use omit function to remove null rows
output <- na.omit(output) 


#Use merge function to merge two datasets
Combined <- merge(input,output,by=c("State","Frequency","Commodity","Date"))


#New columns calculating percentage values for each Import and Export			
Combined$ImportValuePct <- Combined$ImportValue / sum(Combined$ImportValue)
Combined$ExportValuePct <- Combined$ExportValue / sum(Combined$ExportValue)

#Format columns to show values in percentage
Combined$ImportValuePct <- paste(round(Combined$ImportValuePct*100,digits=1),"%",sep="")
Combined$ExportValuePct <- paste(round(Combined$ExportValuePct*100,digits=1),"%",sep="")

As a result of above code three tables will be created in power bi however we will only be interested in table Combined which is the outcome of above transformations.

Jessicajz_92 · ‎08-25-2016

I have encounter this problem when I tried to add in the Mice library using the code given in one of the post. R Script Error.JPG

Not sure what has gone wrong, it will be good if you can advice me.

Thanks

BR

Jessica

andrewbrick · ‎01-12-2018

You have a serious problem here: you allow users to select multiple dataframes from a single R script, but you don't handle that well during refresh.

When I "get data" from an R script and paste code into the dialog box, Power BI executes the script and displays whatever dataframes are present when the code is completed. You allow users to select multiple dataframes from this list for use in their Power BI report. However, when a refresh is conducted, Power BI executes this single R script as many times as the number of dataframes you selected to include in the report. This can cause highly unexpected behavior when the R script is referencing files on drives that may be changed as a result of the script.

If I select multiple dataframes from a single R script I input into Power BI, Power BI ought to be smart enough to know that when I click "Refresh," that it only needs to run the script once and pull resulting data into the two (or more) tables I originally selected when I "got data."

Anonymous · ‎01-11-2019

Hi @andrewbrick

I am not sure to well understand your comment:

are you saying that for the moment, power BI does not tackle well the refresh when having > 1 dataset uploaded through a R script? (Cause it runs R scripts as many datasets?).

andrewbrick · ‎01-11-2019

@Anonymous,

Yes, I think. If I set an R script as a data source and select three dataframes from that script, Power BI creates three separate datasets, each with its own copy of the same R script. When a refresh is triggered, Power BI will execute these three scripts (which are the same script) in parallel.

I don't think this is a huge problem unless write operations occur in the script (e.g. write.csv(...)). But even still, it's unjustifiably inefficient for Power BI to run the script three times when it only needs to run it once. Power BI engineers should create a way for Power BI to keep track of those instances when multiple datasets come from a single R script. In those cases, a refresh should trigger a single execution of the script, and Power BI should feed the resulting data into those multiple datasets.

Anonymous · ‎01-14-2019

@andrewbrick I get your point,

I am generating two datasets with my R script and I think there is no problems in terms of running twice my script but it is useless.

How can we escalate this to make it happen (run only once the R script)? Let's make it happen (except if there is a specific reason why it is done like that so far and there should be!)

Anonymous · ‎05-13-2019

@andrewbrick - also when you need to make a change to said R script - you need to change it for all 3 table sources. In the Power Query Editor - in the Applied Steps, the first step is Source - it would need to be updated here for all 3 of the tables that you are bringing in (to keep the code consistent). There should be way to "update the source" code of the R script in one location - such as the Data Source Settings.

kcwilsonii · ‎09-23-2021

How do you edit the R script after loading the data to make changes?

I have a script loading data but I have to add table from another database

on the same server.

If you perform "get data" and then run the same script you the same tables again but with a "2" after it.

now I have duplicate tables

R script as data source in Power BI Desktop

Merging

Unveiling the Power of Lakehouse's SQL Analytics E...

How to provide sample data in the Power BI Forum

Re: Power BI Tip: How To Sort A Column With Duplic...

Important Links every Power BI enthusiastic shoul...

Enhanced Data Profiling in Power Query: GUI and Ta...