Extracting data from PDF File

Ramya_data · ‎03-21-2024

Hi all, I am trying to extract investment banking related data which is in pdf into POWERBI . The pdf is a not completely in a tabular structure. Its a mix of text ,graphs and table. I want to extract the data in table form for my analysis. When i load the pdf file, i could see the column is mixed up with text as well as the data in table format which becomes difficult for cleaning. Has anyone experienced this?? Is there any simple way to do this..??

v-junyant-msft · ‎03-21-2024

Hi @Ramya_data ,

I'm not quite sure what the exact data structure of your PDF looks like, but you could try using Power Query to connect to the PDF.
When you connect to a PDF file using Power Query, it attempts to identify tables within the PDF. However, it might not always perfectly separate text and tables as you've experienced.
After loading your PDF into Power Query, you can manually select the portion of the data you're interested in. This step might require some trial and error to isolate the tables from the surrounding text.
Once you've managed to import the data into Power BI, you might need to perform some data cleaning operations. Power Query offers a variety of transformation and cleaning features, such as removing unwanted columns, filtering rows, splitting columns by delimiter, and replacing values.

However, the above suggestions are more for your convenience in cleaning up your data after it has been loaded. There may not be a better suggestion for data connection, please try to standardize the data format in your PDF, and separate different format types in different pages as far as possible.

Best Regards,
Dino Tao
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

dufoq3 · ‎03-21-2024

There is no simple way probably, but if you share the pdf with us, maybe we can help you.

Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

Ramya_data · ‎03-22-2024

https://www.blackrock.com/se/individual/literature/fact-sheet/iumo-ishares-edge-msci-usa-momentum-fa...

this is the link for the pdf file i am working on.

I want the product information,key facts data on the page 1 to be in table format. i am having 10 of similar pdfs to extract the data from and use for my analysis.

dufoq3 · ‎03-22-2024

You pasted the link into link reference twice.

Try this:

Result

let
    Source = Pdf.Tables(Web.Contents("https://www.blackrock.com/se/individual/literature/fact-sheet/iumo-ishares-edge-msci-usa-momentum-factor-ucits-etf-fund-fact-sheet-en-se.pdf"), [Implementation="1.3"]),
    Page001 = Source{[Id="Page001"]}[Data],
    SelectColumnWithData = Table.FromColumns(List.Select(Table.ToColumns(Page001), each List.Contains(_, "PRODUCT INFORMATION"))),
    Ad_HeaderColumn = Table.AddColumn(SelectColumnWithData, "HeaderColumn", each if List.Contains({"product information", "key facts", "portfolio characteristics"}, [Column1], Comparer.OrdinalIgnoreCase) then [Column1] else null,type text),
    FilledDown = Table.FillDown(Ad_HeaderColumn,{"HeaderColumn"}),
    FilteredRows2 = Table.SelectRows(FilledDown, each not List.Contains(List.Distinct(FilledDown[HeaderColumn]), [Column1]) and [HeaderColumn] <> null)
in
    FilteredRows2

Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

Extracting data from PDF File

Helpful resources

Fabric certifications survey