Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

ruthpozuelo

Do all your data "cleaning" with Power Query: This is how I used it for WordPress

I recently moved a site from wordpress.org to wordpress.com.

 

I was hosting the site on my server, but as it reached (sometimes) 800 visitors a day, it took too much of the server capacity, so I decided to move it. As I am not monetizing the site, I moved it to wordpress.com.

 

During the move, something went really wrong and all my images’ URLs got broken. The site was live but no images to show and as this is a how-to site, without the pictures, well..., half the value was gone. In wordpress.com it is not possible to access the database, so I was stuck.

 

The problem:

 

Here is how my URLs looked after the move:

 

<a href="http://synoguide.com/2016/04/21/backup-your-files-to-an-external-drive-or-a-shared-file-dsm-6-0/"><i... class="alignnone size-full wp-image-226" src="http://trantor.synology.me/ruthpozuelo/synology-nas-menu-button.jpg" alt="synology NAS menu button" width="56" height="27" />

 

and this is what I needed done:

 

Extract the correct URL (in blue) and replace it with the second one (in green) and remove the entire <ahref= > paragraph.... on 63 blog posts and replace hundreds of wrong URLS…..not a task to do in one day or two….

 

The solution:

 

Every time I have had the need to do repetitive tasks with text or numbers I have used excel, so I through I give it a try. Unfortunately, after a few minutes at it and some google searches, I realized this was not a task for Excel, and then I remembered: I can use Power Query!!

 

This is what I managed to do in Power Query:

 

I copied the entire blog post html:

 

copy wordpress text.png

and got a new text back with the right URLs in excel:

 

copy excel text.png

 

And this is how I did it:

  

I copied the URL in excel and loaded it into Power Query to give it a go at cleaning the mess.

 

1. import data into power query.png

Once imported, I duplicated the column:

 

2. duplicate column power query.png

Now, I needed to extract the blue URL:

<a href="http://synoguide.com/2016/04/21/backup-your-files-to-an-external-drive-or-a-shared-file-dsm-6-0/"><i... class="alignnone size-full wp-image-226" src="http://trantor.synology.me/ruthpozuelo/synology-nas-menu-button.jpg" alt="synology NAS menu button" width="56" height="27" />

 

To accomplish that, I split the column by delimiter ” to remove <a href=" from the URL:

 

3. split column by delimiter.png

 

And this is the result:

 

4. after spllit1.png

Now I needed to remove everything to the right of the URL, so I did the same split again:

 

3. split column by delimiter.png

And this is the result:

 

5.after split 2.png

So, I accomplished the first part: I had the correct URL in one column. I renamed that: New URL:

 

 6.rename column power query.png

 

 

Now I needed to isolate the green URL to replace it with the blue:

 

<a href="http://synoguide.com/2016/04/21/backup-your-files-to-an-external-drive-or-a-shared-file-dsm-6-0/"><i... class="alignnone size-full wp-image-226" src="http://trantor.synology.me/ruthpozuelo/synology-nas-menu-button.jpg" alt="synology NAS menu button" width="56" height="27" />

 

To do that I started by removing the > from the last split:

 

7. remove character.png

There are a lot of ways to do this, but perhaps the easiest one is to do a split again:

 

8. split by delimiter.png

When I did that split, suddenly all the ” disappeared from the URL:

 

9. colons dissapeared.png

No worries, I can get them back later. But anybody knows why they were removed?

 

Ok, back to business.

 

Now it was time to isolate the green URL:

<img class="alignnone size-full wp-image-226" src="http://trantor.synology.me/ruthpozuelo/synology-nas-menu-button.jpg" alt="synology NAS menu button" width="56" height="27" />

 

I could not use the split by "= "or by ”, as i would split the url in the wrong place, so I chose to split by ":"

 

 

10.split by colon.png

 

And here is the result:

 

11.after split.png

Now I needed to split the URL form the right side, so I did an split with space:

 

12.split by space.png

And here is the result:

 

13. after split.png

Now that I isolated the URL i needed to change, I deleted it and I renamed the columns to understand the order later: URL1 and URL3:

 

14. rename urls.png

So my final URL would be URL1 & New URL & URL 3, but there was one last problem. When I split the green URL, I had to do it by with colon, which means that URL1 had http and New URL had http too, so I had to remove it from one of the columns:

 

15.remove http.png

 

For that I used a simple replace:

 

16.replace http.png

Before I concatenated the URLs I had to get back the ” that disappeared during one of the splits. I did that using replace again:

 

17. get the commas back.png

After I have done that on all the places, I could finally build my new URL:

 

18.concatenate power query.png

And this is how the new URL looked like:

 

19 after the concatenate.png

 

Now I had both my old URL and my new URL. Done!!!

 

20. new url old url.png

……

 

Well, not quite….

 

Finding the URLs in the text was not an easy task and proved quite time consuming. After changing the URLs manually in one blog post I was back in Power Query with my thinking hat on.

 

I had an idea: How about copy the entire text and have Power Query give me a new text with the new URLs on the right places? Is that even possible? I had to give it a try.

 

A big enabler of the idea was the fact that all the image URL’s stood alone on a new paragraph so I could filter them out.

This is what I did: I copied the entire blog post html to excel and imported it in Power Query.

 

 

21. copy entire text.png

Once in Power query I filtered by the rows that started with ”<a href”, that filtered all my URLs in the correct order:

 

22. filter rows.png

And then I copied, in the Advance editor, the code from the last transformation to avoid redoing the steps one by one.

 

23. advance editor.png

Once that was done, I was left with all new and old URLs.

 

24. old and new in order.png

Now, how do I replace them back in the text? I thought about merging...

 

So I loaded the text again in Power Query and I added an index, so I would not lose the order of the rows:

 

25. add index.png

And I merged the two tables together:

 

26. merge columns.png

And expanded on the new URL only:

 

27. expand on new url.png

I sorted the rows by the index and I could see how everything aligned perfectly!! Hurrah!

 

28. sort rows.png

The only thing left was to substitute the old ones for the new ones.

 

For that I created a custom column that basically says:

 

"if the new expanded column NewColumn.Custom is empty, then copy the contents of column1 otherwise, copy the contents of NewColumn.Custom"

 

(My apologies for the awful column names, I always rename things properly, but as this was a use-and-throw tool, I did not took the time for it.)

 

 

Here is the formula:

 

29. custom column.png

I removed the columns I didn’t need, loaded everything in Excel and after sorting the columns by index:

 

30.sorting in excel.png

I had my text ready to copy in WordPress.

 

To change the URL’s on all my posts all I needed to do was to copy to excel, press update and paste to WordPress!!!

 

31. final text.png

 

 

Lifesaver!! 🙂

  

Want to see the file? I attach it here for your review. I warn you in advance of the awful column names, but as this was a use-and-throw tool, I did not took the time to name them properly.

Comments