Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
Betty888
Helper II
Helper II

How to add lm.augment output to my original dataset

Dear all,

I'm using the following R code in power query, and it's working well to provide me the studentized residuals that I use to flag outliers in a big dataset :

As output, I get the 'df' object containing the col "group" + the augment output including the studentized residuals column.

 

    library(tidyverse)   

    library(broom)    

    dataset <- as.data.frame(dataset)   

    dataset$perf <- as.numeric(dataset$perf)   

    dataset$factor1 <- as.factor(dataset$factor1)   

    dataset$factor2 <- as.factor(dataset$factor2)    

    df <- dataset %>%   

             group_by(group) %>%     

             mutate(unique_factor1 = n_distinct(factor1), unique_factor1 = n_distinct(factor1), var = var(perf)) %>%     

             filter( unique_factor1 != 1 & unique_factor2 != 1 & var != 0 ) %>%   

             do(cbind(group = .$group, lm(perf ~ factor1 + factor2, data = .) %>% augment))

 

My objective is obtaining the original dataset + the studentized column to avoid having 2 big tables (my original table has 20M of rows, not easy to have 2 tables with such big size ...). Thanks a lot for your help !

3 REPLIES 3
v-xinruzhu-msft
Community Support
Community Support

Hi @Betty888 

You can refer to the following link, it may can help you.

r - Creating models and augmenting data without losing additional columns in dplyr/broom - Stack Ove...

 

Best Regards!

Yolo Zhu

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Hi !

Many thanks for the link you shared, it was very useful ! 

Now I have mu final code giving me exactly what I need : 

 library(tidyverse)   

 library(broom)    

 dataset <- as.data.frame(dataset)   

 dataset$perf <- as.numeric(dataset$perf)   

 dataset$factor1 <- as.factor(dataset$factor1)   

 dataset$factor2 <- as.factor(dataset$factor2)

dataset <- dataset %>%
mutate(row=row_number())     

    df <- dataset %>%   

      group_by(group) %>%     

     mutate(unique_factor1 = n_distinct(factor1), unique_factor1 = n_distinct(factor1), var = var(perf)) %>%     

      filter( unique_factor1 != 1 & unique_factor2 != 1 & var != 0 ) %>%   

      do(cbind(row = .$row, lm(perf ~ factor1 + factor2, data = .) %>% augment))

 

df <- left_join(dataset, df, by = "row")

 

With the huge size of my table, do you think there is a way to optimize this code?

Many thanks again 🙏

 

Hi @Betty888 

Maybe you can use data.table instead of data.frame.

You can refer to the following links. it may help you.

r - alternate of cbind and rbind in loop - Stack Overflow

r - how to filter by group keys on an already grouped dataframe (grouped_df) - Stack Overflow

 

Best Regards!

Yolo Zhu

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

 

Helpful resources

Announcements
LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors