Solved: Re: Failure to extract file from ZIP

Anonymous · ‎06-27-2020

I know there has been several attempts to address this previously, but so far without resolution (as far as I am aware).

So I am providing more detailed information into the zip file properties, in the hope that people much smarter than me, will be able to identify the root-cause of the problem.

Background

Custom function to extract files from Zip, successfully created in line with Mark White's BI Blog
When trying to access the binary data of the CSV file, Power Query throws the error "We didn’t recognize the format of your first file (). Please filter the list of files so it contains only supported types (Text, .csv, Excel workbooks etc.) and try again"

New Information

The error mentioned above in point #2, only occurs when I am using ZIP files which havw been generated by SAP BI.
If I extract the CSV file from the zip, and manually create a new zip file containing the same CSV, then Power Query is able to successfully access the binary data using the same custom function created above in point #1.

This indicates that the Power Query error is being triggered by either the method used to compress the CSV file or potentially the code for the custom function not handling all compression formats (??)

ZIP File Details

The zip file without the problem has been named works.zip and the one which causes the error is doesnotwork.zip
Here you can view the properties of the compressed CSV files inside their respective ZIPs (works and doesnotwork), the differences are highlighted in yellow on the "doesnotwork" image
Here are the properties for the zip files themselves, again with the differences highlighted in yellow

I'm hoping this can shed some light on this problem & maybe generate a solution.

Thanks in advance for anyone spending time looking into this, and if it's useful here are the query and custom function in excel

artemus · ‎06-27-2020

Have you tried the implementation I posted here: https://community.powerbi.com/t5/Power-Query/How-to-connect-Azure-DevOps-REST-API-in-to-power-bi/m-p...

View solution in original post

lbendlin · ‎06-27-2020

Here is a version that should work with all of your ZIP files. It ignores the local file entries and grabs the data from the central directory instead. Please test it out.

// expects full path to the ZIP file, only extracts the first data file after getting its size from the central directory
// https://en.wikipedia.org/wiki/Zip_(file_format)#Structure
(ZIPFile) =>
let
    //read the entire ZIP file into memory - we'll use it often so this is worth it
    Source = Binary.Buffer(File.Contents(ZIPFile)),
    // get the full size of the ZIP file
    Size = Binary.Length(Source),
    //Find the start of the central directory at the sixth to last byte
    Directory = BinaryFormat.Record([ 
                    MiscHeader=BinaryFormat.Binary(Size-6), 
                    Start=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian)
            ]) ,
    Start = Directory(Source)[Start],
    //find the first entry in the directory and get the compressed file size
    FirstDirectoryEntry = BinaryFormat.Record([ 
                    MiscHeader=BinaryFormat.Binary(Start+20), 
                    FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
                    UnCompressedFileSize=BinaryFormat.Binary(4),
                    FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
                    ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian)
            ]) ,
    //figure ou where the raw data starts            
    Offset = 30+FirstDirectoryEntry(Source)[FileNameLen]+FirstDirectoryEntry(Source)[ExtrasLen],
    Compressed = FirstDirectoryEntry(Source)[FileSize]+1,
    //get the raw data of the compressed file
    Raw = BinaryFormat.Record([
                    Header=BinaryFormat.Binary(Offset), 
                    Data=BinaryFormat.Binary(Compressed)
            ]) 
    // unzip it
in 
    Binary.Decompress(Raw(Source)[Data], Compression.Deflate)

Name it unzip and and then call it like this

let
    Source = unzip("C:\downloads\doesnotwork.zip"),
    #"Imported CSV" = Csv.Document(Source,[Delimiter=",",  Encoding=1252, QuoteStyle=QuoteStyle.None])
in
    #"Imported CSV"

View solution in original post

lbendlin · ‎06-27-2020

Your doesnotwork.zip file does not adhere to the "old" ZIP standard. Both the compressed file size and the uncompressed file size of the "Report 1.csv" file are zero in the local header. That behavior is normally reserved for files over 4GB.

That can be handled by the unzip tools (which use the central directory at the end), but your M code (which doesn't use the directory - it uses local file headers ) expects actual data there. You need to modify the M code to handle this different situation.

https://en.wikipedia.org/wiki/Zip_(file_format)#Structure

This gets even more exciting:

https://stackoverflow.com/questions/4802097/how-does-one-find-the-start-of-the-central-directory-in-...

Here is the hex dump of the end of your zip file.

0x02014b50 marks the beginning of the official copy of the central directory, as indicated by the last few bytes 0x00761ba9 (which point to the start of the directory)

then at 0x00761bbd you can see the compressed size of the first file (0x00761b6f) and the uncompressed size (0x04eddec7) which translates to 82697927 bytes - exactly the uncompressed size of report 1.csv

That means you need to provide the 0x00761b6f value to the decompression function.

lbendlin · ‎06-27-2020

For reference here is my current code, with the hardcoded size of that one ZIP file so you can see that it works. Obviously this code needs work to properly handle cases where the file size is reported as zero

//  credits:  http://www.excelandpowerbi.com/?p=155
// http://sql10.blogspot.com/2016/06/reading-zip-files-in-powerquery-m.html
let
    Source = File.Contents("C:\downloads\doesnotwork.zip"),
    //define function
    Decompress = (ZIPFile) => 
    let 
    //describe ZIP file format
    MyBinaryFormat = BinaryFormat.Record([ 
					  MiscHeader=BinaryFormat.Binary(18), 
					  FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
					  UnCompressedFileSize=BinaryFormat.Binary(4),
					  FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
					  ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian)
            ]) ,
    // how many bytes to get        
    MyCompressedFileSizer = MyBinaryFormat(ZIPFile)[FileSize]+1 ,
    MyCompressedFileSize = 7740272 ,
    // impacts offset
    MyFileNameLen = MyBinaryFormat(ZIPFile)[FileNameLen],
    MyExtrasLen = MyBinaryFormat(ZIPFile)[ExtrasLen] ,
    // describe where the actual data starts and how long it is
    MyBinaryFormat2 = BinaryFormat.Record([
            Header=BinaryFormat.Binary(30+MyFileNameLen+MyExtrasLen), 
            Data=BinaryFormat.Binary(MyCompressedFileSize)
    ]) ,
    // grab the data and deflate it
    DecompressData = Binary.Decompress(MyBinaryFormat2(ZIPFile)[Data], Compression.Deflate) 
    in
    //return deflated binary data
      DecompressData,
    // call the function with the path to the ZIP file. Only extracts the first file in the ZIP, regardless of name  
    MyData = Decompress(Source)
in MyData

lbendlin · ‎06-27-2020

Here is a version that should work with all of your ZIP files. It ignores the local file entries and grabs the data from the central directory instead. Please test it out.

// expects full path to the ZIP file, only extracts the first data file after getting its size from the central directory
// https://en.wikipedia.org/wiki/Zip_(file_format)#Structure
(ZIPFile) =>
let
    //read the entire ZIP file into memory - we'll use it often so this is worth it
    Source = Binary.Buffer(File.Contents(ZIPFile)),
    // get the full size of the ZIP file
    Size = Binary.Length(Source),
    //Find the start of the central directory at the sixth to last byte
    Directory = BinaryFormat.Record([ 
                    MiscHeader=BinaryFormat.Binary(Size-6), 
                    Start=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian)
            ]) ,
    Start = Directory(Source)[Start],
    //find the first entry in the directory and get the compressed file size
    FirstDirectoryEntry = BinaryFormat.Record([ 
                    MiscHeader=BinaryFormat.Binary(Start+20), 
                    FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
                    UnCompressedFileSize=BinaryFormat.Binary(4),
                    FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
                    ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian)
            ]) ,
    //figure ou where the raw data starts            
    Offset = 30+FirstDirectoryEntry(Source)[FileNameLen]+FirstDirectoryEntry(Source)[ExtrasLen],
    Compressed = FirstDirectoryEntry(Source)[FileSize]+1,
    //get the raw data of the compressed file
    Raw = BinaryFormat.Record([
                    Header=BinaryFormat.Binary(Offset), 
                    Data=BinaryFormat.Binary(Compressed)
            ]) 
    // unzip it
in 
    Binary.Decompress(Raw(Source)[Data], Compression.Deflate)

Name it unzip and and then call it like this

let
    Source = unzip("C:\downloads\doesnotwork.zip"),
    #"Imported CSV" = Csv.Document(Source,[Delimiter=",",  Encoding=1252, QuoteStyle=QuoteStyle.None])
in
    #"Imported CSV"

Anonymous · ‎06-27-2020

@lbendlin firstly thanks for your efforts & excellent analysis, I really appreacite it.

I have tried your code, but at the moment it is only returning the first column from the zip file:

lbendlin · ‎06-28-2020

Remove and reapply the csv import step.

Anonymous · ‎06-28-2020

Sorry, but I don't understand??

lbendlin · ‎06-28-2020

the .Buffer functions load the data once into memory and avoid reloading it over and over again (as is done in the original code)

Anonymous · ‎06-28-2020

that part I understand.

What I don't understand is what you meant by "Remove and reapply the csv import step" or how to get all the columns from the zip

lbendlin · ‎06-28-2020

I meant you should remove the CSV step from my last instructions. Try this instead. It will result in a CSV icon that you can then double click.

let
    Source = unzip("C:\downloads\doesnotwork.zip")
in
    Source

Anonymous · ‎07-08-2020

@lbendlin sorry, I'm back on this topic again 🙄

How would you modify your code to handle mutilple CSV files inside a ZIP?
Currently I always get the first CSV.

Thank you

lbendlin · ‎07-08-2020

Here is a link to a full comprehensive solution by @artemus

https://community.powerbi.com/t5/Power-Query/How-to-connect-Azure-DevOps-REST-API-in-to-power-bi/m-p...

Rob_B · ‎09-23-2020

@lbendlin thanks for providing a solution to this problem. I'm running into a problem and I can't figure out what's going on. When I try to invoke the custom Unzip function, I receive the following error:

An error occurred in the " query. Expression.Error: We cannot convert a value of type Binary to type Text.
Details:
     Value=[Binary]
     Type=[Type]

Any ideas? I should note that I tried Mark White's solution first (http://sql10.blogspot.com/2016/06/reading-zip-files-in-powerquery-m.html) but I ran into issues with that solution as well.

Thanks in advance for your help!

lbendlin · ‎09-23-2020

You may want to share some more details. I also wrote a blog entry on this very topic if you are interested.

Rob_B · ‎09-23-2020

Thanks @lbendlin, I'd love to see your blog. Can you please share the link?

I tried using Mark White's solution to access this zip file but I receive the following error:

His custom function works for other zip files on the same website (using the Web.Contents function as the source) however it doesn't work on the zip file linked above. My guess is that the file size is too large (1.52 GB compressed, >10 GB uncompressed). The .csv in the .zip folder is so large that Zip64 is used. Perhaps that's why Mark's solution doesn't work.

I tried your function but when I invoke it, Power BI gives me the following error message:

I'm not familiar enough with M code to find out what's going wrong in your function. I should note that I receive the same error for all zip files on the website linked above, including the smaller zip files not using Zip64.

Thanks!

Rob

lbendlin · ‎09-23-2020

Here's the blog

https://community.powerbi.com/t5/Community-Blog/Working-With-Zip-Files-in-Power-Query/ba-p/1190186

but your file sizes may be the actual issue. I only tested with files up to 600 MB.

Rob_B · ‎09-23-2020

Thanks @lbendlin, your blog post is really helpful!

Next step is to figure out how to modify the function to accomodate larger file sizes using Zip64 😀

Anonymous · ‎06-28-2020

Ah ok, apologies.

Confirm all is working as expected.

Thanks again for your time & effort on this 👍

artemus · ‎06-27-2020

Have you tried the implementation I posted here: https://community.powerbi.com/t5/Power-Query/How-to-connect-Azure-DevOps-REST-API-in-to-power-bi/m-p...

lbendlin · ‎06-28-2020

@artemus thank you for providing the complete structural solution. I have added a sprinkle of Binary.Buffer to it to make it a bit faster.

Anonymous · ‎06-27-2020

@artemus I hadn't seen that post, but you are right, it does work! 🙂

Thanks for your help

Failure to extract file from ZIP

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024