Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.
Dear all,
I'm using the following R code in power query, and it's working well to provide me the studentized residuals that I use to flag outliers in a big dataset :
As output, I get the 'df' object containing the col "group" + the augment output including the studentized residuals column.
library(tidyverse)
library(broom)
dataset <- as.data.frame(dataset)
dataset$perf <- as.numeric(dataset$perf)
dataset$factor1 <- as.factor(dataset$factor1)
dataset$factor2 <- as.factor(dataset$factor2)
df <- dataset %>%
group_by(group) %>%
mutate(unique_factor1 = n_distinct(factor1), unique_factor1 = n_distinct(factor1), var = var(perf)) %>%
filter( unique_factor1 != 1 & unique_factor2 != 1 & var != 0 ) %>%
do(cbind(group = .$group, lm(perf ~ factor1 + factor2, data = .) %>% augment))
My objective is obtaining the original dataset + the studentized column to avoid having 2 big tables (my original table has 20M of rows, not easy to have 2 tables with such big size ...). Thanks a lot for your help !
Hi @Betty888
You can refer to the following link, it may can help you.
Best Regards!
Yolo Zhu
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Hi !
Many thanks for the link you shared, it was very useful !
Now I have mu final code giving me exactly what I need :
library(tidyverse)
library(broom)
dataset <- as.data.frame(dataset)
dataset$perf <- as.numeric(dataset$perf)
dataset$factor1 <- as.factor(dataset$factor1)
dataset$factor2 <- as.factor(dataset$factor2)
dataset <- dataset %>%
mutate(row=row_number())
df <- dataset %>%
group_by(group) %>%
mutate(unique_factor1 = n_distinct(factor1), unique_factor1 = n_distinct(factor1), var = var(perf)) %>%
filter( unique_factor1 != 1 & unique_factor2 != 1 & var != 0 ) %>%
do(cbind(row = .$row, lm(perf ~ factor1 + factor2, data = .) %>% augment))
df <- left_join(dataset, df, by = "row")
With the huge size of my table, do you think there is a way to optimize this code?
Many thanks again 🙏
Hi @Betty888
Maybe you can use data.table instead of data.frame.
You can refer to the following links. it may help you.
r - alternate of cbind and rbind in loop - Stack Overflow
r - how to filter by group keys on an already grouped dataframe (grouped_df) - Stack Overflow
Best Regards!
Yolo Zhu
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.