Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
Anonymous
Not applicable

regmatches woes in R/Query Editor

Hi all

 

I'm trying to do some pattern matching in the query editor. The intention being that I can then merge a reformatted postcode to an enormous table, which I do not want to upload in it's entirety to PBI online.

 

I have got validation working fine with grepl, but I need to ensure a common 8 character format. But I am getting an error in the column, the R scri[pt executes, but the column returns "error" on all rows.

 

This is where I am:

 

# 'dataset' holds the input data for this script
pattern1 <- "([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))"
pattern2 <- "[0-9][A-Za-z]{2}"
matches <- function(x) {gregexpr(pattern2 , as.character(x), perl=TRUE)}
part2 <- regmatches(dataset$post_code, matches(dataset$post_code))
PCDS <- within(dataset, {PCDS = part2})


I think the problem is that grepexpr returns a multidimensional array. I need the 1st index to go in the column "PCDS" but can't get the syntax right.

 

Please help! Thanks in advance!

1 ACCEPTED SOLUTION
Anonymous
Not applicable

Fixed it! The regex could be more comprehensive and it's very step by step, but this R script converts UK postcodes matching the regex pattern to an 8 character format in the query editor.

# 'dataset' holds the input data for this script
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("\\s+|\\s+", "", x)
ws <- function(x){ 
    if(x>0){strrep(" ",x)}
    else {""}
}

pattern <- "^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?)))) ?[0-9][A-Za-z]{2}"
match <- function(x) {gregexpr(pattern , as.character(x), perl=TRUE)}
part <- function(x) {c(regmatches(x, match(x)))}

pCodeList <- part(dataset$post_code)
pCodeVector <- as.character(pCodeList)

for(i in 1:length(pCodeVector))
{
    pCodeVector[i] <- trim(pCodeVector[i])
    pCodeLength <- nchar(pCodeVector[i])
    pCodeRight <- toupper(substr(pCodeVector[i], pCodeLength - 2, pCodeLength))
    pCodeLeft <- toupper(substr(pCodeVector[i],1, pCodeLength -3))
    pCodeVector[i] <- paste(pCodeLeft, ws((8 - nchar(pCodeLeft)) - nchar(pCodeRight)), pCodeRight, sep='')
}

PCDS <- within(dataset, {PCDS = (pCodeVector)})

Development discussion warmly welcomed.

 

Beth

x

View solution in original post

4 REPLIES 4
v-yuta-msft
Community Support
Community Support

@Anonymous,

 

Don't know to much about R script and the error message you have encountered with,  however, you may modify the last statement with below:

dataset$PCDS <- within(dataset, {dataset$PCDS = part2})

 

Community Support Team _ Jimmy Tao
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly. 

Anonymous
Not applicable

Unfortunately putting "dataset$" in either of those positions returns an empty table which then causes an error in the query editor. 😞

Anonymous
Not applicable


I'm still stuck with:

# 'dataset' holds the input data for this script
pattern1 <- "([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))"
pattern2 <- "[0-9][A-Za-z]{2}"
match1 <- function(x) {gregexpr(pattern1 , as.character(x), perl=TRUE)}
part1 <- function(x) {c(regmatches(x, match1(x)))}

PCDS <- within(dataset, {PCDS = part1(post_code)})

After playing around in the R shell:

> post_code<-c("A12 3BC","A123BC","A12    3BC","a12 3bc","1234")

> pattern1 <- "([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))"

> match1 <- function(x) {gregexpr(pattern1 , as.character(x), perl=TRUE)}

> part1 <- function(x) {c(regmatches(x, match1(x)))}

> part1(post_code)
[[1]]
[1] "A12"

[[2]]
[1] "A12"

[[3]]
[1] "A12"

[[4]]
[1] "a12"

[[5]]
character(0)

> post_code
[1] "A12 3BC"    "A123BC"     "A12    3BC" "a12 3bc"    "1234"  

As you can see the data has a very different arrangement between post_code and part1(postcode). So I'm stilthinking down the lines that the vector just needs reorganising.

 

Problem is, I learned R last week. This didn't work!

output1<-vector(mode="character", length = 0)
for(val in part1){
    if(length(val)==0){
        output1<=c(part1,"Not UK Postcode")
    }
    else{
        output1<-val
    }
}

So, if you know anything about R vector manipulationm, please help!

xx

Anonymous
Not applicable

Fixed it! The regex could be more comprehensive and it's very step by step, but this R script converts UK postcodes matching the regex pattern to an 8 character format in the query editor.

# 'dataset' holds the input data for this script
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("\\s+|\\s+", "", x)
ws <- function(x){ 
    if(x>0){strrep(" ",x)}
    else {""}
}

pattern <- "^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?)))) ?[0-9][A-Za-z]{2}"
match <- function(x) {gregexpr(pattern , as.character(x), perl=TRUE)}
part <- function(x) {c(regmatches(x, match(x)))}

pCodeList <- part(dataset$post_code)
pCodeVector <- as.character(pCodeList)

for(i in 1:length(pCodeVector))
{
    pCodeVector[i] <- trim(pCodeVector[i])
    pCodeLength <- nchar(pCodeVector[i])
    pCodeRight <- toupper(substr(pCodeVector[i], pCodeLength - 2, pCodeLength))
    pCodeLeft <- toupper(substr(pCodeVector[i],1, pCodeLength -3))
    pCodeVector[i] <- paste(pCodeLeft, ws((8 - nchar(pCodeLeft)) - nchar(pCodeRight)), pCodeRight, sep='')
}

PCDS <- within(dataset, {PCDS = (pCodeVector)})

Development discussion warmly welcomed.

 

Beth

x

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.