Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
Anonymous
Not applicable

Failure to extract file from ZIP

I know there has been several attempts to address this previously, but so far without resolution (as far as I am aware).

So I am providing more detailed information into the zip file properties, in the hope that people much smarter than me, will be able to identify the root-cause of the problem. 

 

Background

  1. Custom function to extract files from Zip, successfully created in line with Mark White's BI Blog
  2. When trying to access the binary data of the CSV file, Power Query throws the error "We didn’t recognize the format of your first file (). Please filter the list of files so it contains only supported types (Text, .csv, Excel workbooks etc.) and try again"

 

New Information

  • The error mentioned above in point #2, only occurs when I am using ZIP files which havw been generated by SAP BI.
  • If I extract the CSV file from the zip, and manually create a new zip file containing the same CSV, then Power Query is able to successfully access the binary data using the same custom function created above in point #1.

This indicates that the Power Query error is being triggered by either the method used to compress the CSV file or potentially the code for the custom function not handling all compression formats (??)

 

ZIP File Details

  • The zip file without the problem has been named works.zip and the one which causes the error is doesnotwork.zip
  • Here you can view the properties of the compressed CSV files inside their respective ZIPs (works and doesnotwork), the differences are highlighted in yellow on the "doesnotwork" image
  • Here are the properties for the zip files themselves, again with the differences highlighted in yellow

I'm hoping this can shed some light on this problem & maybe generate a solution.

Thanks in advance for anyone spending time looking into this, and if it's useful here are the query and custom function in excel

2 ACCEPTED SOLUTIONS
artemus
Employee
Employee

Here is a version that should work with all of your ZIP files. It ignores the local file entries and grabs the data from the central directory instead.  Please test it out.

 

// expects full path to the ZIP file, only extracts the first data file after getting its size from the central directory
// https://en.wikipedia.org/wiki/Zip_(file_format)#Structure
(ZIPFile) =>
let
    //read the entire ZIP file into memory - we'll use it often so this is worth it
    Source = Binary.Buffer(File.Contents(ZIPFile)),
    // get the full size of the ZIP file
    Size = Binary.Length(Source),
    //Find the start of the central directory at the sixth to last byte
    Directory = BinaryFormat.Record([ 
                    MiscHeader=BinaryFormat.Binary(Size-6), 
                    Start=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian)
            ]) ,
    Start = Directory(Source)[Start],
    //find the first entry in the directory and get the compressed file size
    FirstDirectoryEntry = BinaryFormat.Record([ 
                    MiscHeader=BinaryFormat.Binary(Start+20), 
                    FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
                    UnCompressedFileSize=BinaryFormat.Binary(4),
                    FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
                    ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian)
            ]) ,
    //figure ou where the raw data starts            
    Offset = 30+FirstDirectoryEntry(Source)[FileNameLen]+FirstDirectoryEntry(Source)[ExtrasLen],
    Compressed = FirstDirectoryEntry(Source)[FileSize]+1,
    //get the raw data of the compressed file
    Raw = BinaryFormat.Record([
                    Header=BinaryFormat.Binary(Offset), 
                    Data=BinaryFormat.Binary(Compressed)
            ]) 
    // unzip it
in 
    Binary.Decompress(Raw(Source)[Data], Compression.Deflate)

 

Name it  unzip and and then call it like this

 

let
    Source = unzip("C:\downloads\doesnotwork.zip"),
    #"Imported CSV" = Csv.Document(Source,[Delimiter=",",  Encoding=1252, QuoteStyle=QuoteStyle.None])
in
    #"Imported CSV"

View solution in original post

20 REPLIES 20
lbendlin
Super User
Super User

Your doesnotwork.zip file does not adhere to the "old" ZIP standard. Both the compressed file size and the uncompressed file size of the "Report 1.csv"  file are zero in the local header. That behavior is normally reserved for files over 4GB.

 

That can be handled by the unzip tools (which use the central directory at the end), but your M code (which doesn't use the directory - it uses local file headers ) expects actual data there.  You need to modify the M code to handle this different situation.

 

https://en.wikipedia.org/wiki/Zip_(file_format)#Structure

 

This gets even more exciting:

https://stackoverflow.com/questions/4802097/how-does-one-find-the-start-of-the-central-directory-in-...

 

Here is the hex dump of the end of your zip file.

0x02014b50 marks the beginning of the official copy of the central directory, as indicated by the last few bytes 0x00761ba9 (which point to the start of the directory)

then at 0x00761bbd you can see the compressed size of the first file (0x00761b6f) and the uncompressed size (0x04eddec7) which translates to 82697927 bytes - exactly the uncompressed size of report 1.csv

 

That means you need to provide the 0x00761b6f value to the decompression function.

 

Annotation 2020-06-27 152623.png  

 

For reference here is my current code, with the hardcoded size of that one ZIP file so you can see that it works.  Obviously this code needs work to properly handle cases where the file size is reported as zero

 

//  credits:  http://www.excelandpowerbi.com/?p=155
// http://sql10.blogspot.com/2016/06/reading-zip-files-in-powerquery-m.html
let
    Source = File.Contents("C:\downloads\doesnotwork.zip"),
    //define function
    Decompress = (ZIPFile) => 
    let 
    //describe ZIP file format
    MyBinaryFormat = BinaryFormat.Record([ 
					  MiscHeader=BinaryFormat.Binary(18), 
					  FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
					  UnCompressedFileSize=BinaryFormat.Binary(4),
					  FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
					  ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian)
            ]) ,
    // how many bytes to get        
    MyCompressedFileSizer = MyBinaryFormat(ZIPFile)[FileSize]+1 ,
    MyCompressedFileSize = 7740272 ,
    // impacts offset
    MyFileNameLen = MyBinaryFormat(ZIPFile)[FileNameLen],
    MyExtrasLen = MyBinaryFormat(ZIPFile)[ExtrasLen] ,
    // describe where the actual data starts and how long it is
    MyBinaryFormat2 = BinaryFormat.Record([
            Header=BinaryFormat.Binary(30+MyFileNameLen+MyExtrasLen), 
            Data=BinaryFormat.Binary(MyCompressedFileSize)
    ]) ,
    // grab the data and deflate it
    DecompressData = Binary.Decompress(MyBinaryFormat2(ZIPFile)[Data], Compression.Deflate) 
    in
    //return deflated binary data
      DecompressData,
    // call the function with the path to the ZIP file. Only extracts the first file in the ZIP, regardless of name  
    MyData = Decompress(Source)
in MyData

Here is a version that should work with all of your ZIP files. It ignores the local file entries and grabs the data from the central directory instead.  Please test it out.

 

// expects full path to the ZIP file, only extracts the first data file after getting its size from the central directory
// https://en.wikipedia.org/wiki/Zip_(file_format)#Structure
(ZIPFile) =>
let
    //read the entire ZIP file into memory - we'll use it often so this is worth it
    Source = Binary.Buffer(File.Contents(ZIPFile)),
    // get the full size of the ZIP file
    Size = Binary.Length(Source),
    //Find the start of the central directory at the sixth to last byte
    Directory = BinaryFormat.Record([ 
                    MiscHeader=BinaryFormat.Binary(Size-6), 
                    Start=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian)
            ]) ,
    Start = Directory(Source)[Start],
    //find the first entry in the directory and get the compressed file size
    FirstDirectoryEntry = BinaryFormat.Record([ 
                    MiscHeader=BinaryFormat.Binary(Start+20), 
                    FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
                    UnCompressedFileSize=BinaryFormat.Binary(4),
                    FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
                    ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian)
            ]) ,
    //figure ou where the raw data starts            
    Offset = 30+FirstDirectoryEntry(Source)[FileNameLen]+FirstDirectoryEntry(Source)[ExtrasLen],
    Compressed = FirstDirectoryEntry(Source)[FileSize]+1,
    //get the raw data of the compressed file
    Raw = BinaryFormat.Record([
                    Header=BinaryFormat.Binary(Offset), 
                    Data=BinaryFormat.Binary(Compressed)
            ]) 
    // unzip it
in 
    Binary.Decompress(Raw(Source)[Data], Compression.Deflate)

 

Name it  unzip and and then call it like this

 

let
    Source = unzip("C:\downloads\doesnotwork.zip"),
    #"Imported CSV" = Csv.Document(Source,[Delimiter=",",  Encoding=1252, QuoteStyle=QuoteStyle.None])
in
    #"Imported CSV"
Anonymous
Not applicable

@lbendlin firstly thanks for your efforts & excellent analysis, I really appreacite it.

 

I have tried your code, but at the moment it is only returning the first column from the zip file:

Capture.JPG

Remove and reapply the csv import step.

Anonymous
Not applicable

Sorry, but I don't understand??

the .Buffer functions load the data once into memory and avoid reloading it over and over again (as is done in the original code)

Anonymous
Not applicable

that part I understand.

 

What I don't understand is what you meant by "Remove and reapply the csv import step" or how to get all the columns from the zip

I meant you should remove the CSV step from my last instructions. Try this instead. It will result in a CSV icon that you can then double click.

 

let
    Source = unzip("C:\downloads\doesnotwork.zip")
in
    Source
Anonymous
Not applicable

@lbendlin sorry, I'm back on this topic again 🙄

 

How would you modify your code to handle mutilple CSV files inside a ZIP?
Currently I always get the first CSV.

 

Thank you

@lbendlin thanks for providing a solution to this problem. I'm running into a problem and I can't figure out what's going on. When I try to invoke the custom Unzip function, I receive the following error:

 

 

 

An error occurred in the " query. Expression.Error: We cannot convert a value of type Binary to type Text.
Details:
     Value=[Binary]
     Type=[Type]

 

 

 

Any ideas? I should note that I tried Mark White's solution first (http://sql10.blogspot.com/2016/06/reading-zip-files-in-powerquery-m.html) but I ran into issues with that solution as well.

 

Thanks in advance for your help!

You may want to share some more details.  I also wrote a blog entry on this very topic if you are interested.

Thanks @lbendlin, I'd love to see your blog. Can you please share the link?

 

I tried using Mark White's solution to access this zip file but I receive the following error: 

Rob_B_0-1600900610934.png

His custom function works for other zip files on the same website (using the Web.Contents function as the source) however it doesn't work on the zip file linked above. My guess is that the file size is too large (1.52 GB compressed, >10 GB uncompressed). The .csv in the .zip folder is so large that Zip64 is used. Perhaps that's why Mark's solution doesn't work.

 

I tried your function but when I invoke it, Power BI gives me the following error message:

 

Rob_B_1-1600900997000.png

 

I'm not familiar enough with M code to find out what's going wrong in your function. I should note that I receive the same error for all zip files on the website linked above, including the smaller zip files not using Zip64.

 

Thanks!

 

Rob

 

Here's the blog

https://community.powerbi.com/t5/Community-Blog/Working-With-Zip-Files-in-Power-Query/ba-p/1190186

 

but your file sizes may be the actual issue. I only tested with files up to 600 MB.

Thanks @lbendlin, your blog post is really helpful!

 

Next step is to figure out how to modify the function to accomodate larger file sizes using Zip64 😀

Anonymous
Not applicable

Ah ok, apologies.

Confirm all is working as expected.

Thanks again for your time & effort on this 👍

artemus
Employee
Employee

@artemus thank you for providing the complete structural solution. I have added a sprinkle of Binary.Buffer to it to make it a bit faster.

Anonymous
Not applicable

@artemus I hadn't seen that post, but you are right, it does work! 🙂

Thanks for your help

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors