Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
Amiregent
New Member

PDFs as a data source

Would anyone have suggestions as to how one could extract data from PDF pages which can then be used in Power BI?

1 ACCEPTED SOLUTION
mllopis
Community Admin
Community Admin

Amiregent,

 

There's currently no solution to import PDF data into Power BI Desktop. This data source is in our radar, as we have received some requests in the past, but we don't have any immediate plans to add it to the product.

 

If this is an important feature for you, please leave a vote for it in our UserVoice Feature Suggestions forum. Please also leave a description about the type of data that you need to extract from the PDF document. For instance, are you trying to extract one/multiple tables from the document, or plain text?

 

Thanks,
M.

View solution in original post

23 REPLIES 23
chezmo
New Member

Extracting table data from PDF documents can be really tricky. For example if a table spans over several pages or if your PDF file is actually a scanned images. There are however PDF Parser solutions on the market which can batch convert PDF to Excel. One I know of is called Docparser.

nwabramson
Advocate III
Advocate III

I would recommend using the Apache tika program to create a text file for parsing. 

 

Sample syntax is simply:

 

python tika.py parse text filename.pdf

 

 

Or: Open in word -> save as webfile -> import as HTML like described here: http://exceleratorbi.com.au/import-tabular-data-pdf-using-power-query/

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

Jim_Philips
Frequent Visitor

A VBA procedure to extract information from the PDF would normally be ideal for a recurring process, such as a PDF file published once a month with new information.  Point the procedure at the new or changed PDF file and push the button again and your Excel tables are refreshed with the new information.

 

The VBA solution I am suggesting here does not involve copy and paste and it does not involve converting your PDF file to Word or Excel first.  The conversion of your PDF to Word or Excel may work for some files that are not large or complex, but the process will likely be slower than the VBA solution that reads the file and extracts the appropriate information to write to your Excel workbook.  I have used the VBA read and extract solution on PDF files as large as 70M and 22,000 pages, writing 3-4 thousand rows to an Excel table.

 

If you need such a solution, contact me.

Hello Jim.  We need a solution to pull in a PDF document from a website on a recurring basis and import to Power BI.

 

Can you please provide further details on the VBA solution can help?

 

Thanks.

As I mentioned in my reply to your Private Message, if I can get a copy of the PDF file and a good description of the information you would like to extract from it to Excel, I could provide a detailed response. 

Just to close the loop on this request: importing from PDF files is currently a preview feature in Power BI Desktop. Please try it out and let us know what you think!

 

Ehren

Jam54
Frequent Visitor

As the new release has made the PDF connector a GA, I was wondering the following:

 

Is there a function or script that can make extraction from PDF tables values automatic? such as data scraping from HTML websites but for a bluk of PDFs files?

 

such as;

If I select a folder with PDFs, can it look for tables in all containing the referenced values/words and only download those tables (automatically) ?

 

 

I haven't found a function that enables such.

Does anyone have any advice?

@Jam54 ,

to my knowledge, all tables would be downloaded and then you can filter after their content.

Just create a table with one URL in each row and add a column where you extract all tables first. Then add another column that filters the column with the list of tables (use Table.Contains if you want to search for a word in all columns)

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

Anyone know when this functionality will be available for power query in excel?

Hi,

My team is currently working on this. I hope the PDF connector in Excel will be available for Office 365 subscribers earlier next year.

Guy

- Excel Team

Hello,
Is there an update on the availability of the M-language function pdf.tables() in Excel?

Last reply was 'early 2020', today it is end April 2020?

Thx.

Best regards,

Dirk

Hi @dverliefden ,

 

The work is in progress. If all goes well and there will be no surprises, we intend open the PDF connector in Excel to Insiders during May-June timeframe.

 

Hope it helps.

 

Guy

- Excel Team

@guyhunkin  I would love to use this.  Any idea when this will be GA ?

Hi! The PDF data connector is already available in Excel for Office 365 subscribers. Please refer to this blog post for more information:

https://techcommunity.microsoft.com/t5/excel-blog/announcing-data-import-from-pdf-documents/ba-p/156...

 

Guy

- Excel Team

Jam54
Frequent Visitor

As the new release has made the PDF connector a GA, I was wondering the following:

 

Is there a function or script that can make extraction from PDF tables values automatic? such as data scraping from HTML websites but for a bluk of PDFs files?

 

such as;

If I select a folder with PDFs, can it look for tables in all containing the referenced values/words and only download those tables (automatically) ?

 

 

I haven't found a function that enables such.

Does anyone have any advice?

I've used the new preview feature quite a bit on one project. After a pothole in the December update it is now (Feb 2019 update) working quite effectively. Due to the nature of the data source, it's always going to be more art than science and need a lot of supporting work in your queries, but this is a very good option to look at.

Jim_Philips
Frequent Visitor

It is possible to write a VBA procedure to read a PDF file and write selected information to your Excel workbook.  With the procedure written, you could create entire tables in Excel from your PDF at the push of a button.  Once the information is in Excel, it is available to Power BI.

 

The type of VBA procedure I have in mind requires you to have, in addition to Excel, Acrobat regular or pro (not Acrobat Reader).

 

The difficulty of writing the VBA procedure will vary with the PDF file and the information you are wanting to extract from it.

mike_honey
Memorable Member
Memorable Member

Word 2013+ can open PDF files and does a reasonable job of interpreting their tables etc.  From there I would copy and paste the data into an Excel file get a consistent set of rows and columns.  Whole tables should come across quite easily.

 

If this was a regular requirement, you could probably record/write a Word macro (VBA) to automate the steps.

mllopis
Community Admin
Community Admin

Amiregent,

 

There's currently no solution to import PDF data into Power BI Desktop. This data source is in our radar, as we have received some requests in the past, but we don't have any immediate plans to add it to the product.

 

If this is an important feature for you, please leave a vote for it in our UserVoice Feature Suggestions forum. Please also leave a description about the type of data that you need to extract from the PDF document. For instance, are you trying to extract one/multiple tables from the document, or plain text?

 

Thanks,
M.

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.