I have a table with multiple columns and in one column (it is not formatted as table) I have more than 62000 rows. this column called Resource_column is full of errors, typos, nomenclatures, etc. I also another column called Reference_Column which is formated as table. I have to compare my Resource_column with Reference_column and replace all the errors with good ones based on the Resource_column. I have put the link to the file here:
and added an screenshots to show the difference:
Notice: the blank rows in the Resource_Column should not be filled at the end of replacing.
Thanks in advance!
Edit: I really appreciate your support as it looks pretty complicated issue. The main problem is that I am looking for a way to not go into Bulk Find and Replace alley as I have a lot of values to be fixed (it is not possible to make a complete find column and use it with replace)
According to your description, I think your requirement should more suitable to use AI training of text analysis instead of power query/Dax.
For this scenario, I'd like to suggest your or do data processing on the source data out of power bi side or try to use a web connector to invoke the external API of AI analysis.
Thank you for the reply!
Would you suggest the best API or AI app/software to take advantage as this is not a recurring job but an ad hoc report!
If you are not familiar with these parts, I'd like to suggest you use the third-party API(e.g MS partner that provides the text analysis service) for these types of AI text analysis services.
Then you only need to use raw records to invoke the API to get the fixed results without complex training and learning steps.
@bestmbaman I would try the fuzzy matching in Merge queries first. If that fails then you can look at my very flexible and tunable DAX Fuzzy matching algorithm here: Fuzzy - Microsoft Power BI Community.
Either way though, I agree with @lbendlin that you won't be able to identify true errors unless there is some other data you have that can identify that.
@Greg_Deckler Thank you Greg!
by other Data, what exactly you mean?! I already shared a reference table (with only one column) that contains all the correct names. The challange is how I can figure multiple type of errors and at the same time how to match two different columns with different number of rows?
Assuming that I figured the errors and fixed them , how can I replace the good values in a column with 406 rows to a column with +62000 rows?!
After all, what programming language other than DAX can be used here if DAX falls short here?
Please don't Archive this question and let other ones to try their opinion, I believe this challange comming from the real world business is not just mine.
The real challange is here, I would kindly appreciate your opinion
@bestmbaman By the data I mean I don't want to have to type all of your data from a screen shot but rather just copy and paste it. In theory you have access to Power Query and could use that to solve this problem, particularly the fuzzy matching options when performing a Merge query. Outside of that you could also possibly do this in R or Python within Power Query as well. But, once you get to the Desktop itselft then yes it is DAX for the most part. And thus you could potentially use my very flexible and tunable DAX Fuzzy matching algorithm here: Fuzzy - Microsoft Power BI Community.
I think you have bigger problems than that. How is the script supposed to be able to tell if it is a typo or an error? Which similarity algorithm are you planning to employ? Soundex? Levenshtein ? some other one?
Thanks you @lbendlin ,
I am totally confused with this challange!
In this case what is the best solution you can think of?
Also please note that this is just a sample of the real data set manipulated and transformed only for this community to show the real problem and seek solution!
Looking forward to hearing from you
Also calling on all great heros:
There is no magic to this - you and your requestor need to agree on which algorithm to use, and what the cutoffs should be. The shorter the words the more risky/useless the algorithms will become.