Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
Ramya_data
New Member

Extracting data from PDF File

Hi all, I am trying to extract investment banking related data which is in pdf into POWERBI . The pdf is a not completely in a tabular structure. Its a mix of text ,graphs and table. I want to extract the data in table form for my analysis. When i load the pdf file, i could see the column is mixed up with text as well as the data in table format which becomes difficult for cleaning. Has anyone experienced this?? Is there any simple way to do this..??

4 REPLIES 4
v-junyant-msft
Community Support
Community Support

Hi @Ramya_data ,

I'm not quite sure what the exact data structure of your PDF looks like, but you could try using Power Query to connect to the PDF.
When you connect to a PDF file using Power Query, it attempts to identify tables within the PDF. However, it might not always perfectly separate text and tables as you've experienced.
After loading your PDF into Power Query, you can manually select the portion of the data you're interested in. This step might require some trial and error to isolate the tables from the surrounding text.
Once you've managed to import the data into Power BI, you might need to perform some data cleaning operations. Power Query offers a variety of transformation and cleaning features, such as removing unwanted columns, filtering rows, splitting columns by delimiter, and replacing values.

However, the above suggestions are more for your convenience in cleaning up your data after it has been loaded. There may not be a better suggestion for data connection, please try to standardize the data format in your PDF, and separate different format types in different pages as far as possible.

Best Regards,
Dino Tao
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

dufoq3
Super User
Super User

There is no simple way probably, but if you share the pdf with us, maybe we can help you.


Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

https://www.blackrock.com/se/individual/literature/fact-sheet/iumo-ishares-edge-msci-usa-momentum-fa...

this is the link for the pdf file i am working on.

 

I want the product information,key facts data on the page 1 to be in table format. i am having 10 of similar pdfs to extract the data from and use for my analysis.

You pasted the link into link reference twice.

 

Try this:

Result

dufoq3_0-1711139214686.png

 

let
    Source = Pdf.Tables(Web.Contents("https://www.blackrock.com/se/individual/literature/fact-sheet/iumo-ishares-edge-msci-usa-momentum-factor-ucits-etf-fund-fact-sheet-en-se.pdf"), [Implementation="1.3"]),
    Page001 = Source{[Id="Page001"]}[Data],
    SelectColumnWithData = Table.FromColumns(List.Select(Table.ToColumns(Page001), each List.Contains(_, "PRODUCT INFORMATION"))),
    Ad_HeaderColumn = Table.AddColumn(SelectColumnWithData, "HeaderColumn", each if List.Contains({"product information", "key facts", "portfolio characteristics"}, [Column1], Comparer.OrdinalIgnoreCase) then [Column1] else null,type text),
    FilledDown = Table.FillDown(Ad_HeaderColumn,{"HeaderColumn"}),
    FilteredRows2 = Table.SelectRows(FilledDown, each not List.Contains(List.Distinct(FilledDown[HeaderColumn]), [Column1]) and [HeaderColumn] <> null)
in
    FilteredRows2

 


Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

Helpful resources

Announcements
LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.

Top Solution Authors
Top Kudoed Authors