Solved: regmatches woes in R/Query Editor

Anonymous · ‎01-15-2019

Hi all

I'm trying to do some pattern matching in the query editor. The intention being that I can then merge a reformatted postcode to an enormous table, which I do not want to upload in it's entirety to PBI online.

I have got validation working fine with grepl, but I need to ensure a common 8 character format. But I am getting an error in the column, the R scri[pt executes, but the column returns "error" on all rows.

This is where I am:

# 'dataset' holds the input data for this script
pattern1 <- "([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))"
pattern2 <- "[0-9][A-Za-z]{2}"
matches <- function(x) {gregexpr(pattern2 , as.character(x), perl=TRUE)}
part2 <- regmatches(dataset$post_code, matches(dataset$post_code))
PCDS <- within(dataset, {PCDS = part2})

I think the problem is that grepexpr returns a multidimensional array. I need the 1st index to go in the column "PCDS" but can't get the syntax right.

Please help! Thanks in advance!

Anonymous · ‎01-23-2019

Fixed it! The regex could be more comprehensive and it's very step by step, but this R script converts UK postcodes matching the regex pattern to an 8 character format in the query editor.

# 'dataset' holds the input data for this script
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("\\s+|\\s+", "", x)
ws <- function(x){ 
    if(x>0){strrep(" ",x)}
    else {""}
}

pattern <- "^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?)))) ?[0-9][A-Za-z]{2}"
match <- function(x) {gregexpr(pattern , as.character(x), perl=TRUE)}
part <- function(x) {c(regmatches(x, match(x)))}

pCodeList <- part(dataset$post_code)
pCodeVector <- as.character(pCodeList)

for(i in 1:length(pCodeVector))
{
    pCodeVector[i] <- trim(pCodeVector[i])
    pCodeLength <- nchar(pCodeVector[i])
    pCodeRight <- toupper(substr(pCodeVector[i], pCodeLength - 2, pCodeLength))
    pCodeLeft <- toupper(substr(pCodeVector[i],1, pCodeLength -3))
    pCodeVector[i] <- paste(pCodeLeft, ws((8 - nchar(pCodeLeft)) - nchar(pCodeRight)), pCodeRight, sep='')
}

PCDS <- within(dataset, {PCDS = (pCodeVector)})

Development discussion warmly welcomed.

Beth

x

View solution in original post

v-yuta-msft · ‎01-15-2019

@Anonymous,

Don't know to much about R script and the error message you have encountered with, however, you may modify the last statement with below:

dataset$PCDS <- within(dataset, {dataset$PCDS = part2})

Community Support Team _ Jimmy Tao
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Anonymous · ‎01-22-2019

Unfortunately putting "dataset$" in either of those positions returns an empty table which then causes an error in the query editor. 😞

Anonymous · ‎01-22-2019

I'm still stuck with:

# 'dataset' holds the input data for this script
pattern1 <- "([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))"
pattern2 <- "[0-9][A-Za-z]{2}"
match1 <- function(x) {gregexpr(pattern1 , as.character(x), perl=TRUE)}
part1 <- function(x) {c(regmatches(x, match1(x)))}

PCDS <- within(dataset, {PCDS = part1(post_code)})

After playing around in the R shell:

> post_code<-c("A12 3BC","A123BC","A12    3BC","a12 3bc","1234")

> pattern1 <- "([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))"

> match1 <- function(x) {gregexpr(pattern1 , as.character(x), perl=TRUE)}

> part1 <- function(x) {c(regmatches(x, match1(x)))}

> part1(post_code)
[[1]]
[1] "A12"

[[2]]
[1] "A12"

[[3]]
[1] "A12"

[[4]]
[1] "a12"

[[5]]
character(0)

> post_code
[1] "A12 3BC"    "A123BC"     "A12    3BC" "a12 3bc"    "1234"

As you can see the data has a very different arrangement between post_code and part1(postcode). So I'm stilthinking down the lines that the vector just needs reorganising.

Problem is, I learned R last week. This didn't work!

output1<-vector(mode="character", length = 0)
for(val in part1){
    if(length(val)==0){
        output1<=c(part1,"Not UK Postcode")
    }
    else{
        output1<-val
    }
}

So, if you know anything about R vector manipulationm, please help!

xx

Anonymous · ‎01-23-2019

Fixed it! The regex could be more comprehensive and it's very step by step, but this R script converts UK postcodes matching the regex pattern to an 8 character format in the query editor.

# 'dataset' holds the input data for this script
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("\\s+|\\s+", "", x)
ws <- function(x){ 
    if(x>0){strrep(" ",x)}
    else {""}
}

pattern <- "^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?)))) ?[0-9][A-Za-z]{2}"
match <- function(x) {gregexpr(pattern , as.character(x), perl=TRUE)}
part <- function(x) {c(regmatches(x, match(x)))}

pCodeList <- part(dataset$post_code)
pCodeVector <- as.character(pCodeList)

for(i in 1:length(pCodeVector))
{
    pCodeVector[i] <- trim(pCodeVector[i])
    pCodeLength <- nchar(pCodeVector[i])
    pCodeRight <- toupper(substr(pCodeVector[i], pCodeLength - 2, pCodeLength))
    pCodeLeft <- toupper(substr(pCodeVector[i],1, pCodeLength -3))
    pCodeVector[i] <- paste(pCodeLeft, ws((8 - nchar(pCodeLeft)) - nchar(pCodeRight)), pCodeRight, sep='')
}

PCDS <- within(dataset, {PCDS = (pCodeVector)})

Development discussion warmly welcomed.

Beth

x

regmatches woes in R/Query Editor

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly