Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
I'm pulling in a large set of PDF files from a blob container in Azure, 4,735 files totalling 476 MB. The main issue I'm having is that when refreshing, or even just adding a step in Power Query, over 900 MB of data is loaded. It takes a very long time to refresh and apply any new steps in Power Query. I've attached the relavent Power Query code below. This report is just intended to identify any corrupted PDFs or PDFs that are in the wrong format so that they can be fixed for use in another report so this query is relatively simple.
// simpleLoad let Source = (#"Sample File Parameter1" as binary) => let Source = Pdf.Tables(#"Sample File Parameter1"), Table002 = Source{[Id="Table002"]}[Data] in Table002 in Source
// timesheets (2) let Source = AzureStorage.Blobs(acct), timesheets1 = Source{[Name="timesheets"]}[Data], #"buffer" = Table.Buffer(timesheets1), #"Invoke Custom Function1" = Table.AddColumn(#"buffer", "simpleLoad", each #"simpleLoad"([Content])), #"Renamed Columns1" = Table.RenameColumns(#"Invoke Custom Function1", {"Name", "Source.Name"}), #"Removed Other Columns1" = Table.SelectColumns(#"Renamed Columns1", {"Source.Name", "simpleLoad"}), #"Kept Errors" = Table.SelectRowsWithErrors(#"Removed Other Columns1", {"simpleLoad"}), #"Added Custom" = Table.AddColumn(#"Kept Errors", "Error", each try [simpleLoad]), #"Expanded Error" = Table.ExpandRecordColumn(#"Added Custom", "Error", {"Error"}, {"Error.1"}), #"Expanded Error.1" = Table.ExpandRecordColumn(#"Expanded Error", "Error.1", {"Reason", "Message"}, {"Reason", "Message"}), #"Removed Columns" = Table.RemoveColumns(#"Expanded Error.1",{"simpleLoad"}) in #"Removed Columns"
I found another post that deals with the same issue but with gzip files, it was recommended there that you disable auto generated tables for date hierarchies, which did not change anything in my report. I added the buffer step in to hopefully make adding steps go quicker in Power Query but that also didn't seem to help.
Hello @anjbauer1993 ,
Were you able to figure out why your data load was doubled in PBI, I am facing similar issue in my load from Azure blob storage. I am reading a csv file which is around 4 GB but PBI load more than 8 GB for it.
Can anyone help on it?
From the power query you have posted, it seems like you have expend the Error column in several steps. I'm not sure but this could be the cause of this issue, so for troubleshooting you may remove these steps and check if the data size is normal.
let Source = AzureStorage.Blobs(acct), timesheets1 = Source{[Name="timesheets"]}[Data], #"buffer" = Table.Buffer(timesheets1), #"Invoke Custom Function1" = Table.AddColumn(#"buffer", "simpleLoad", each #"simpleLoad"([Content])), #"Renamed Columns1" = Table.RenameColumns(#"Invoke Custom Function1", {"Name", "Source.Name"}), #"Removed Other Columns1" = Table.SelectColumns(#"Renamed Columns1", {"Source.Name", "simpleLoad"}), #"Kept Errors" = Table.SelectRowsWithErrors(#"Removed Other Columns1", {"simpleLoad"}), #"Added Custom" = Table.AddColumn(#"Kept Errors", "Error", each try [simpleLoad]), #"Removed Columns" = Table.RemoveColumns(#"Added Custom", {"simpleLoad"}) in #"Removed Columns"
Community Support Team _ Jimmy Tao
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Unfortunately that did not do the trick. I've created a sample dataset in order to move through things quicker, now only 69 MB. I also created a new query that only pulls the files in and does no transformation, but it's still downloading 132 MB when refreshing. The Power Query code is below
// timesheets-test let Source = AzureStorage.Blobs(acct), #"timesheets-test1" = Source{[Name="timesheets-test"]}[Data] in #"timesheets-test1"
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.