How to add lm.augment output to my original datase...

Betty888 · ‎10-31-2023

Dear all,

I'm using the following R code in power query, and it's working well to provide me the studentized residuals that I use to flag outliers in a big dataset :

As output, I get the 'df' object containing the col "group" + the augment output including the studentized residuals column.

library(tidyverse)

library(broom)

dataset <- as.data.frame(dataset)

dataset$perf <- as.numeric(dataset$perf)

dataset$factor1 <- as.factor(dataset$factor1)

dataset$factor2 <- as.factor(dataset$factor2)

df <- dataset %>%

group_by(group) %>%

mutate(unique_factor1 = n_distinct(factor1), unique_factor1 = n_distinct(factor1), var = var(perf)) %>%

filter( unique_factor1 != 1 & unique_factor2 != 1 & var != 0 ) %>%

do(cbind(group = .$group, lm(perf ~ factor1 + factor2, data = .) %>% augment))

My objective is obtaining the original dataset + the studentized column to avoid having 2 big tables (my original table has 20M of rows, not easy to have 2 tables with such big size ...). Thanks a lot for your help !

v-xinruzhu-msft · ‎11-01-2023

Hi @Betty888

You can refer to the following link, it may can help you.

r - Creating models and augmenting data without losing additional columns in dplyr/broom - Stack Ove...

Best Regards!

Yolo Zhu

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Betty888 · ‎11-02-2023

Hi !

Many thanks for the link you shared, it was very useful !

Now I have mu final code giving me exactly what I need :

library(tidyverse)

library(broom)

dataset <- as.data.frame(dataset)

dataset$perf <- as.numeric(dataset$perf)

dataset$factor1 <- as.factor(dataset$factor1)

dataset$factor2 <- as.factor(dataset$factor2)

dataset <- dataset %>%
mutate(row=row_number())

df <- dataset %>%

group_by(group) %>%

mutate(unique_factor1 = n_distinct(factor1), unique_factor1 = n_distinct(factor1), var = var(perf)) %>%

filter( unique_factor1 != 1 & unique_factor2 != 1 & var != 0 ) %>%

do(cbind(row = .$row, lm(perf ~ factor1 + factor2, data = .) %>% augment))

df <- left_join(dataset, df, by = "row")

With the huge size of my table, do you think there is a way to optimize this code?

Many thanks again 🙏

v-xinruzhu-msft · ‎11-03-2023

Hi @Betty888

Maybe you can use data.table instead of data.frame.

You can refer to the following links. it may help you.

r - alternate of cbind and rbind in loop - Stack Overflow

r - how to filter by group keys on an already grouped dataframe (grouped_df) - Stack Overflow

Best Regards!

Yolo Zhu

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

How to add lm.augment output to my original dataset

Helpful resources

Fabric certifications survey

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024