Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Grow your Fabric skills and prepare for the DP-600 certification exam by completing the latest Microsoft Fabric challenge.

Reply
Betty888
Helper II
Helper II

How to add lm.augment output to my original dataset

Dear all,

I'm using the following R code in power query, and it's working well to provide me the studentized residuals that I use to flag outliers in a big dataset :

As output, I get the 'df' object containing the col "group" + the augment output including the studentized residuals column.

 

    library(tidyverse)   

    library(broom)    

    dataset <- as.data.frame(dataset)   

    dataset$perf <- as.numeric(dataset$perf)   

    dataset$factor1 <- as.factor(dataset$factor1)   

    dataset$factor2 <- as.factor(dataset$factor2)    

    df <- dataset %>%   

             group_by(group) %>%     

             mutate(unique_factor1 = n_distinct(factor1), unique_factor1 = n_distinct(factor1), var = var(perf)) %>%     

             filter( unique_factor1 != 1 & unique_factor2 != 1 & var != 0 ) %>%   

             do(cbind(group = .$group, lm(perf ~ factor1 + factor2, data = .) %>% augment))

 

My objective is obtaining the original dataset + the studentized column to avoid having 2 big tables (my original table has 20M of rows, not easy to have 2 tables with such big size ...). Thanks a lot for your help !

3 REPLIES 3
v-xinruzhu-msft
Community Support
Community Support

Hi @Betty888 

You can refer to the following link, it may can help you.

r - Creating models and augmenting data without losing additional columns in dplyr/broom - Stack Ove...

 

Best Regards!

Yolo Zhu

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Hi !

Many thanks for the link you shared, it was very useful ! 

Now I have mu final code giving me exactly what I need : 

 library(tidyverse)   

 library(broom)    

 dataset <- as.data.frame(dataset)   

 dataset$perf <- as.numeric(dataset$perf)   

 dataset$factor1 <- as.factor(dataset$factor1)   

 dataset$factor2 <- as.factor(dataset$factor2)

dataset <- dataset %>%
mutate(row=row_number())     

    df <- dataset %>%   

      group_by(group) %>%     

     mutate(unique_factor1 = n_distinct(factor1), unique_factor1 = n_distinct(factor1), var = var(perf)) %>%     

      filter( unique_factor1 != 1 & unique_factor2 != 1 & var != 0 ) %>%   

      do(cbind(row = .$row, lm(perf ~ factor1 + factor2, data = .) %>% augment))

 

df <- left_join(dataset, df, by = "row")

 

With the huge size of my table, do you think there is a way to optimize this code?

Many thanks again 🙏

 

Hi @Betty888 

Maybe you can use data.table instead of data.frame.

You can refer to the following links. it may help you.

r - alternate of cbind and rbind in loop - Stack Overflow

r - how to filter by group keys on an already grouped dataframe (grouped_df) - Stack Overflow

 

Best Regards!

Yolo Zhu

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

 

Helpful resources

Announcements
RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.

MayPowerBICarousel1

Power BI Monthly Update - May 2024

Check out the May 2024 Power BI update to learn about new features.

Top Solution Authors
Top Kudoed Authors