Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
PaulDBrown
Community Champion
Community Champion

Extracting multiple PDF files from a folder

Good evening!

I need to extract multiple pdf files (I'm trying to use the Folder connector) where each file has a different number of pages. My limited knowledge of M code is all the more evident during this process: I'm stuck on the "sample file" which requests I select a "page" to transform. Basically I can't work out how to import all the possible pages from each file to access all available "relevant" data. 
I've attached a zip file with three PDFs (each with a different number of pages) as a sample. I need to get access to all the pages from each file to be able to work on the transformations.
Any guidance will be a huge help!

Many thanks for your time.

Best,

Paul.





Did I answer your question? Mark my post as a solution!
In doing so, you are also helping me. Thank you!

Proud to be a Super User!
Paul on Linkedin.






5 REPLIES 5
nidheshtiwari
Frequent Visitor

Hi Paul,

 

did you resolved the above the problem ? There are 28 tables in your first file so just want to confirm wether all the pdf files have same number of tables and strucures ? Also do you wish to extract all the 28 tables or a specific table ?

Thanks

tmhwk
New Member

Hi all

just joining the talk now. Has a solution been found for multiple pdfs with multiple pages?. I have exactly same situation and looking for a solution

watkinnc
Super User
Super User

You could always delete the sample query, and also delete the invocation of the sample creamery in your main file, which will give you just a list of the tables, which you can then select individually with another query and transform each separately.

 

--Nate


I’m usually answering from my phone, which means the results are visualized only in my mind. You’ll need to use my answer to know that it works—but it will work!!

@watkinnc Thanks for the suggestion, but I'm not too sure what you mean. The source data is 34 PDF files with at least half a dozen pages each (where there are rows/text which I don't need mixed with data which I need to transform). 

This is the interface I get when I select the folder connector. I need to select a "page" to access the sample file:

Load form folder.jpg

which leads to the following in the Transform file code:

= (Parameter1 as binary) => let
    Source = Pdf.Tables(Parameter1, [Implementation="1.3"]),
    Page001 = Source{[Id="Page001"]}[Data],
    #"Changed Type1" = Table.TransformColumnTypes(Page001,{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}, {"Column4", type text}, {"Column5", type text}, {"Column6", type text}, {"Column7", type text}, {"Column8", type text}}),
    #"Changed Type" = Table.TransformColumnTypes(#"Changed Type1",{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}, {"Column4", type text}, {"Column5", type text}, {"Column6", type text}, {"Column7", type text}, {"Column8", type text}})
in
    #"Changed Type"

where the code is set to "page1". The final Query loading the data leads with this code:

Source = Folder.Files("D:\OneDrive - In2-Action.com\Biniarbolla\Informes MB\Estructura corta"),
    #"Filtered Hidden Files1" = Table.SelectRows(Source, each [Attributes]?[Hidden]? <> true),
    #"Invoke Custom Function1" = Table.AddColumn(#"Filtered Hidden Files1", "Transform File", each #"Transform File"([Content])),
    #"Renamed Columns1" = Table.RenameColumns(#"Invoke Custom Function1", {"Name", "Source.Name"}),
    #"Removed Other Columns1" = Table.SelectColumns(#"Renamed Columns1", {"Source.Name", "Transform File"}),
    #"Expanded Table Column1" = Table.ExpandTableColumn(#"Removed Other Columns1", "Transform File", Table.ColumnNames(#"Transform File"(#"Sample File"))),
    #"Changed Type" = Table.TransformColumnTypes(#"Expanded Table Column1",{{"Source.Name", type text}, {"Column1", type text}, {"Column2", type text}, {"Column3", type text}, {"Column4", type text}, {"Column5", type text}, {"Column6", type text}, {"Column7", type text}, {"Column8", type text}}),

This only loads the first page for each file. How can I change it to load every page from each file?

 

Many thanks!

 





Did I answer your question? Mark my post as a solution!
In doing so, you are also helping me. Thank you!

Proud to be a Super User!
Paul on Linkedin.






Hi, Paul! Try this .pbix file like example.

Helpful resources

Announcements
RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.

MayPowerBICarousel

Fabric Monthly Update - May 2024

Check out the May 2024 Fabric update to learn about new features.

LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.

Top Solution Authors
Top Kudoed Authors