cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Mizpah Frequent Visitor
Frequent Visitor

Best way to cleanse this data?

Hi folks, 

 

Working on a first PowerBi project this weekend (having a blast with it!) but running into some gremlins, I have a few queries, but will seperate into different posts!

 

I am trying to cleanse / normalise around 40k rows of data that relate to sold products. An example mockup of the issue is below:

 

Steak32 Day Dry Aged 
Ribeye SteakHereford32 Day Dry-Aged
CentreCut Fillet Longhorn 32 Days
Steak32 dayaged, Aberdeen Angus 

 

What is the best way to 'clean' this data? It currently coming from an excel sheet, but once the proof of concept works will be realtime data in an Azure SQL server or warehouse (and be a far larger data set).

 

In the example above I would want to be able to say '32 Day Dry-Aged' is the data I want, and '32 Day Dry Aged', '32 Days' and '32 dayaged' all mean the same thing. I would want to look for any of those values, and if they exist, create a new column, call it 'Maturation' and populate it accordingly. 

 

I would then need to look in both columns for breed values (in this example Hereford & Longhorn), create column 'Breed' and populate accordingly.

In the real data, I have around 20 attributes (such as 32 Day Dry-Aged, 16 Day Dry-Aged etc) spread between attribute columns. There is no consistancy as to the column that an attribute will be found in, and some products have multiple attributes in the cell.

 

My current though process is around a method using DAX? But searching on 'data normalisation' or 'data cleansing' is not getting me very far.

 

Thanks in advance for any pointers! 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
edhans New Contributor
New Contributor

Re: Best way to cleanse this data?

You should definitely user Power Query and not DAX for this.

 

Since you have such random data it seems, what I would do is create a cross reference table. So find all possible values, using Grouping in Power Query and Load that in an Excel sheet. Then create a second column in the Excel sheet and type in what you want. For example:

Orig Data - New Data

32AgedData - 32 Days

32DayAged - 32 Days

 

etc. It will be tedious to get it all, but it will be far FAR easier than doing dozens/hundreds of IF/THEN type statements.

 

Now load this new table and merge with your source data, then use the "New Data" column as your values.

 

You'll need to periodically see if anything new has come up. Even typos, like 32AegdData. Add those to the table as it happens.

 

You should ultimately load this data into a new table in Azure SQL with the rest of the data, but to quickly get up and running, leaving it in Excel during development works fine.

4 REPLIES 4
edhans New Contributor
New Contributor

Re: Best way to cleanse this data?

You should definitely user Power Query and not DAX for this.

 

Since you have such random data it seems, what I would do is create a cross reference table. So find all possible values, using Grouping in Power Query and Load that in an Excel sheet. Then create a second column in the Excel sheet and type in what you want. For example:

Orig Data - New Data

32AgedData - 32 Days

32DayAged - 32 Days

 

etc. It will be tedious to get it all, but it will be far FAR easier than doing dozens/hundreds of IF/THEN type statements.

 

Now load this new table and merge with your source data, then use the "New Data" column as your values.

 

You'll need to periodically see if anything new has come up. Even typos, like 32AegdData. Add those to the table as it happens.

 

You should ultimately load this data into a new table in Azure SQL with the rest of the data, but to quickly get up and running, leaving it in Excel during development works fine.

Mizpah Frequent Visitor
Frequent Visitor

Re: Best way to cleanse this data?

Thanks Edhans - off to look at power query then!

Mizpah Frequent Visitor
Frequent Visitor

Re: Best way to cleanse this data?

Ok, Some progress, now I just need to work out how to follow instructions better! I have currently completed a bunch of data transofrmation in excel, and created lookup tables with distinct values - I have also removed any 'multivalue' splitting the values between the attribute columns.

 

This is now imported into BI and I am trying to merge this data. With the options I can see, I suspect I have misinterpreted the 'merge' instruction.

 

I am at this step:

 

Capture.PNG

 

Is there a turorial anywhere that would cover looking up the values in attributeOne, and Attribute Two (from ButcherAppImport), and then if a match is found between the attribute columns and the 'old attributes' (in attribute translations) populate the matchiing value from either 'maturation', 'range' or 'reared' in a new column in the original 'butcherAppImport Table (called Maturation, Range or Reared accordingly).

 

Cheers!

 

 

Highlighted
edhans New Contributor
New Contributor

Re: Best way to cleanse this data?

A couple of things I see that might help:

  1. I think you want to merge Attribute2 in the first table to old attributes in the 2nd. For grins, I almost ALWAYS trim the data in both of those columns first to ensure there aren't invisible leading/traling spaces that will mess up the join. Just select the column prior to the merge step, right-click, transform, Trim. Let it insert the step.
  2. I would also merge the Maturation/Range/Reared into a single column prior to the merge so you don't have to select one post-merge. Otherwise, you'll need to expand all 3, then before loading it into the PowerBI model, you'll need to add another column that does a bunch of if this is null then that else if this is null then that else that.

Make sense?