Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
seedrs91
Frequent Visitor

Very tricky text extraction

Hi all, I have two columns below where there are multiple “tenants” in one column and multiple “areas” in another. I want to extract all the “SA Majeste” and “PWGSC” entities from the Tenant column such that in the new first column, I get:

 

  • SA MAJESTE LA REINE ( 81001133)
  • SA MAJESTE LA REINE (81001086

 

etc. and their corresponding areas—formatted as “(XX,XXX.00)”—in the new second column:

 

  • 6,075
  • 7,453

 

The SA Majeste or PWGSC entities are not always in second place. They could be in third, fourth or nth place. Thus, there is a pattern for extraction, but I had no luck figuring out the M code for this extraction. Unfortunately, all the data comes from a PDF online, so I have no control over how the data are structured…I’d have normalized the data to make the table analysis-friendly otherwise.

 

seedrs91_0-1689190600254.jpeg

4 REPLIES 4
smpa01
Super User
Super User

@seedrs91 is there any other data format avaialble other than PDF for the download? such as XML/HTML?

Did I answer your question? Mark my post as a solution!
Proud to be a Super User!
My custom visualization projects
Plotting Live Sound: Viz1
Beautiful News:Viz1, Viz2, Viz3
Visual Capitalist: Working Hrs

Unfortunately not, It is only available in PDF.

Like you said there is no apparent pattern to this and trying to come up with a pattern and then translate that into regex and then using js within power query seems like a difficult task for this (from performance standpoint too).

There is one thing that comes to my mind; if use node.js; there is a package called pdf2json. So you can create a node project and utilize pdf2json to convert that to a json first and then pass the json to power bi. I am not 100% sure though if it would still give you what you need.

 

Or if you have direct access to data vendor, you can ask them to format the data with desired delimiter between tenants and corresponding fields before making it avaialble to clients. I did exactly this in similar situation like you at past.

Did I answer your question? Mark my post as a solution!
Proud to be a Super User!
My custom visualization projects
Plotting Live Sound: Viz1
Beautiful News:Viz1, Viz2, Viz3
Visual Capitalist: Working Hrs

Sorry, I'm still a beginner so may I ask what is it that pdf2json does that helps with my goal?

 


@smpa01 wrote:

 

Or if you have direct access to data vendor, you can ask them to format the data with desired delimiter between tenants and corresponding fields before making it avaialble to clients. I did exactly this in similar situation like you at past.


Apologies but this won't be possible. This is a bid whereby the vendor doesn't provide preferntial treatment to any bidder.

Helpful resources

Announcements
PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

Top Solution Authors
Top Kudoed Authors