cancel
Showing results for 
Search instead for 
Did you mean: 
ibarrau

[PowerQuery] Decompress .zip Files

The first thing we need to know before writing the code is that zip extensions are more complex than gzip. The extension has different compression methods which will make it difficult to read from Power BI. Even though the key will be a Power Query function that I have found many years ago that works for 90% of the cases, I don't think it will always work. It hasn’t fail for me yet.

In order to read a file, we will add a custom function, because Power BI doesn't have a decompress zip function by default. We have to create it.

Let's check my GitHub to get the code.

https://github.com/ibarrau/PowerBi-code/blob/master/PowerQuery/ExtractZIP.pq

Then, inside Power BI Desktop, get data from a blank query to copy the code we have just seen in GitHub in order to create the function.

copying zippqfunctioncopying zippqfunction

As you can see, my custom function name will be DecompressZip(). Now that we have the custom function, we need to execute it for a file. We can use the same example of a gzip file with a small difference in the steps, running a function that contains a file as a parameter but without specifying a compression method.

The first step is reading the file. We can get data from a csv and pick up "all files" in the drop-down extension list or modify the code manually from blank query, if you prefer writing it. In any case, it should look like this:

 

= File.Contents(“C:\Datasets\Energy Data.zip”)

ibarrau_1-1598362842060.png

 

Second Step, add a custom step that will decompress the zip file from the previous step.

= DecompressZip(Origen)


finaldecompresszipfinaldecompresszip

This way, we obtain a similar result like connecting a folder in filesystem. It allows us to explore the content to combine files as one table or to just take one of them.

The next action you choose for Power Query is your own. Now you know how to read a .zip file. I hope it doesn't fail for you and you find it useful.

 

Original Spanish Post

Comments
Anonymous

How do I read .csv files?

@Anonymous to read the csv files at the last picture you can just click on the two down arrows next to the column content.

If you are talking about csv in general you can read it in the doc https://docs.microsoft.com/en-us/power-bi/connect-data/desktop-connect-csv

Regards,

Anonymous

I tried it, but when I click it gets wrong.

 

I need to download this .zip file and access the .csv file data within it

Link: https://balanca.economia.gov.br/balanca/bd/comexstat-bd/ncm/EXP_COMPLETA.zip

 

EXP_COMPLETA.zip -> EXP_COMPLETA.csv

I had the same issue. I guess you got a 0-byte file? There are other sollutions on the internet that can do this in M, but none of the worked for me... :S. After some reading into the cause, I found that my zip files had no file lenght in the "local header".

 

I worked my way arround this issue by grabbing the "End of central directory record" and used its info to grab the file list from the central directory (using the same method as @ibarrau ). That gave me enough information to pinpoint the range in the binary for each file in the Zip. I build that into a M function. You can find it here:

 

(https://github.com/Michael19842/PowerBiFunctions/blob/main/ZipFile/Unzip.m) just create a function and call it with the binary of your file.

 

 

 

 

// Function build for decompressing a Zip file even if the filelength is missing in the localheader
// More info on zip files here: https://en.wikipedia.org/wiki/ZIP_(file_format)

// Known limitation: If there is a comment appended to the central header then this function will fail. You can use a hex editor to find these comments at the end of the file wicht will be just readable text  

(ZipFile as binary) => let
    //Load the file into a buffer
    ZipFileBuffer = Binary.Buffer(ZipFile),
    ZipFileSize = Binary.Length(ZipFileBuffer),

    //Constant values used in the query
    CentralHeaderSignature = 0x02014b50,
    CentralHeaderSize = 42,
    LocalHeaderSize = 30,

    // Predefined byteformats that are used many times over 
    Unsigned16BitLittleIEndian = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
    Unsigned32BitLittleIEndian = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),

    // Definition of central directory header
    CentralDirectoryHeader  = BinaryFormat.Record([
        Version                = Unsigned16BitLittleIEndian,
        VersionNeeded          = Unsigned16BitLittleIEndian,
        GeneralPurposeFlag     = Unsigned16BitLittleIEndian,
        CompressionMethod      = Unsigned16BitLittleIEndian,
        LastModifiedTime       = Unsigned16BitLittleIEndian,
        LastModifiedDate       = Unsigned16BitLittleIEndian,
        CRC32                  = Unsigned32BitLittleIEndian,
        CompressedSize         = Unsigned32BitLittleIEndian,
        UncompressedSize       = Unsigned32BitLittleIEndian,
        FileNameLength         = Unsigned16BitLittleIEndian,
        ExtrasLength           = Unsigned16BitLittleIEndian,
        FileCommentLenght      = Unsigned16BitLittleIEndian,
        DiskNumberStarts       = Unsigned16BitLittleIEndian,
        InternalFileAttributes = Unsigned16BitLittleIEndian,
        EnternalFileAttributes = Unsigned32BitLittleIEndian,
        LocalHeaderOffset      = Unsigned32BitLittleIEndian
    ]),

    // Definition of the end of central directory record
    EndOfCentralDirectoryRecord  = BinaryFormat.Record([
        RestOfFile                      = BinaryFormat.Binary(ZipFileSize-22), 
        EOCDsignature                   = Unsigned32BitLittleIEndian,
        NumberOfThisDisk                = Unsigned16BitLittleIEndian,
        DiskWhereCentralDirectoryStarts = Unsigned16BitLittleIEndian,
        NumberOfRecordsOnThisDisk       = Unsigned16BitLittleIEndian,
        TotalNumberOfRecords            = Unsigned16BitLittleIEndian,
        CentralDirectorySize            = Unsigned32BitLittleIEndian,
        OffsetToStart                   = Unsigned32BitLittleIEndian
    ]),

    //Formatter used for building a table of all files in te central directory
    CentralHeaderFormatter = BinaryFormat.Choice(
        Unsigned32BitLittleIEndian, // Should contain the signature
        each if _ <> CentralHeaderSignature // Test if the signature is not there
            then BinaryFormat.Record([
                        LocalHeaderOffset = null,
                        CompressedSize = null,
                        FileNameLength = null,
                        HeaderSize = null,
                        IsValid  = false,
                        Filename = null
                ]) // if so create a dummy entry 
            else BinaryFormat.Choice(
                    //Catch the staticly sized part of the central header
                    BinaryFormat.Binary(CentralHeaderSize),   
                    //Create a record containing the files size, offset(of the local header), name, etc.. 
                    each BinaryFormat.Record([ 
                        LocalHeaderOffset = CentralDirectoryHeader(_)[LocalHeaderOffset],
                        CompressedSize = CentralDirectoryHeader(_)[CompressedSize],
                        FileNameLength = CentralDirectoryHeader(_)[FileNameLength],
                        HeaderSize = LocalHeaderSize + 
                                    CentralDirectoryHeader(_)[FileNameLength] + 
                                    CentralDirectoryHeader(_)[ExtrasLength],
                        IsValid  = true,
                        Filename = BinaryFormat.Text(CentralDirectoryHeader(_)[FileNameLength])
                        ]),
                    type binary 
                )
    ),

    //Get a record of the end of central directory, this contains the offset of the central header so we can itterate from that position
    EOCDR = EndOfCentralDirectoryRecord(ZipFileBuffer),

    //Get the central directory as a binary extract
    CentralDirectory = Binary.Range(ZipFileBuffer,EOCDR[OffsetToStart]),

    //A list formatter for the central directory  
    CentralDirectoryFormatter = BinaryFormat.List(CentralHeaderFormatter, each _[IsValid] = true),

    //Get a Table from Records containing the file info extracted from the central directory
    FilesTable  = Table.FromRecords(List.RemoveLastN(CentralDirectoryFormatter(CentralDirectory),1)),
    
    //Add the binary to the table and decompress it
    ReturnValue = Table.AddColumn(FilesTable,"Content",each Binary.Decompress(Binary.Range(ZipFileBuffer,[LocalHeaderOffset] + [HeaderSize],[CompressedSize]),Compression.Deflate))
in
    ReturnValue

 

 

 

 

and if you want to give yourself a challenge 😛

https://en.wikipedia.org/wiki/ZIP_(file_format)  and a Hex Editor got me a long way

I am getting an Error with the below code: DataFormat.Error: Block length does not match with its complement. Also earlier got a DataFormat Error: Found invalid data while decoding. Other than the files being zipped with no parent folder, unsure what I am doing is wrong: (ZipFile as binary) => let //Load the file into a buffer ZipFileBuffer = Binary.Buffer(ZipFile), ZipFileSize = Binary.Length(ZipFileBuffer), //Constant values used in the query CentralHeaderSignature = 0x02014b50, CentralHeaderSize = 42, LocalHeaderSize = 30, // Predefined byteformats that are used many times over Unsigned16BitLittleIEndian = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian), Unsigned32BitLittleIEndian = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian), // Definition of central directory header CentralDirectoryHeader = BinaryFormat.Record([ Version = Unsigned16BitLittleIEndian, VersionNeeded = Unsigned16BitLittleIEndian, GeneralPurposeFlag = Unsigned16BitLittleIEndian, CompressionMethod = Unsigned16BitLittleIEndian, LastModifiedTime = Unsigned16BitLittleIEndian, LastModifiedDate = Unsigned16BitLittleIEndian, CRC32 = Unsigned32BitLittleIEndian, CompressedSize = Unsigned32BitLittleIEndian, UncompressedSize = Unsigned32BitLittleIEndian, FileNameLength = Unsigned16BitLittleIEndian, ExtrasLength = Unsigned16BitLittleIEndian, FileCommentLenght = Unsigned16BitLittleIEndian, DiskNumberStarts = Unsigned16BitLittleIEndian, InternalFileAttributes = Unsigned16BitLittleIEndian, EnternalFileAttributes = Unsigned32BitLittleIEndian, LocalHeaderOffset = Unsigned32BitLittleIEndian ]), // Definition of the end of central directory record EndOfCentralDirectoryRecord = BinaryFormat.Record([ RestOfFile = BinaryFormat.Binary(ZipFileSize-22), EOCDsignature = Unsigned32BitLittleIEndian, NumberOfThisDisk = Unsigned16BitLittleIEndian, DiskWhereCentralDirectoryStarts = Unsigned16BitLittleIEndian, NumberOfRecordsOnThisDisk = Unsigned16BitLittleIEndian, TotalNumberOfRecords = Unsigned16BitLittleIEndian, CentralDirectorySize = Unsigned32BitLittleIEndian, OffsetToStart = Unsigned32BitLittleIEndian ]), //Formatter used for building a table of all files in te central directory CentralHeaderFormatter = BinaryFormat.Choice( Unsigned32BitLittleIEndian, // Should contain the signature each if _ <> CentralHeaderSignature // Test if the signature is not there then BinaryFormat.Record([ LocalHeaderOffset = null, CompressedSize = null, FileNameLength = null, HeaderSize = null, IsValid = false, Filename = null ]) // if so create a dummy entry else BinaryFormat.Choice( //Catch the staticly sized part of the central header BinaryFormat.Binary(CentralHeaderSize), //Create a record containing the files size, offset(of the local header), name, etc.. each BinaryFormat.Record([ LocalHeaderOffset = CentralDirectoryHeader(_)[LocalHeaderOffset], CompressedSize = CentralDirectoryHeader(_)[CompressedSize], FileNameLength = CentralDirectoryHeader(_)[FileNameLength], HeaderSize = LocalHeaderSize + CentralDirectoryHeader(_)[FileNameLength] + CentralDirectoryHeader(_)[ExtrasLength], IsValid = true, Filename = BinaryFormat.Text(CentralDirectoryHeader(_)[FileNameLength]) ]), type binary ) ), //Get a record of the end of central directory, this contains the offset of the central header so we can itterate from that position EOCDR = EndOfCentralDirectoryRecord(ZipFileBuffer), //Get the central directory as a binary extract CentralDirectory = Binary.Range(ZipFileBuffer,EOCDR[OffsetToStart]), //A list formatter for the central directory CentralDirectoryFormatter = BinaryFormat.List(CentralHeaderFormatter, each _[IsValid] = true), //Get a Table from Records containing the file info extracted from the central directory FilesTable = Table.FromRecords(List.RemoveLastN(CentralDirectoryFormatter(CentralDirectory),1)), //Add the binary to the table and decompress it ReturnValue = Table.AddColumn(FilesTable,"Content",each Binary.Decompress(Binary.Range(ZipFileBuffer,[LocalHeaderOffset] + [HeaderSize],[CompressedSize]),Compression.Deflate)) in ReturnValue

@rosepetal140 I'm not sure what the issue is with your files. Both sollutions posted here work in a different way and depending on the Zip file, one might work while the other one failes.

 

Can you pershaps share your zip file? 

Yes! Please find the files attached: https://www.dropbox.com/sh/onuk5ev1mm839xr/AACUuQqp0UpWNbnpBTF95ZEUa?dl=0 .  If you cannot access please let me know if there is another way I can share.  Its a zip with two files: measure and program cost.  when I do one, I only see the program cost.  when I user the other solution I see both but get an error.  Thank you so much!

@rosepetal140 : Ok, I found the issue. It seems the compression type used in your zip file is different. You should use "Compression.None" instead of "Compression.Deflate". You can find the used compression using a HEX editor like https://hexed.it/.  

 

See below screen of your file:

Michael19842_0-1642436705081.png

 

There you see that bytes 9 and 10 are both 00. From de definition on wiki (https://en.wikipedia.org/wiki/ZIP_(file_format)) we can find out that those bits tell the compression type used

 

Michael19842_1-1642436910213.png

Please let me know if that wordked for you, or if you need any more assistance. Please also remove the zip file from dropbox. Good luck 😄

So Im a total noob and trying to understand all of this lol, but thank you it worked!!!!!!!!!!!!!!!!!!!  I cant thank you enough you are my HERO!!!!!! Engineering would have yelled at me for weeks about this (hard enough being a female in their world) and given up on me so YAY!!!! To be honest I dont know how you figured it out.   ido see the 00 on the hex but how you determined based on wiki is a mystery - i definitely dont want to waste your time at all so dont feel obligated to break it down to me.  But thank you!!!!!

@rosepetal140  NP, glad to read that you got it working!

 

On a more personal note... This stuff can be hard, and it took me some time figure out how to resolve the issue I had with my zip files as well. The sollution @ibarrau posted here, greatly assisted in debugging and finding a sollution for my Zipfile issue.  

 

Don't think to little of yourself. You figured out how get this issue you had resolved, and you can take pride in that! 

 

"One of the biggest defects in life is the inability to ask for help." ~ Robert Kiyosaki

Hey I have a somewhat similar problem, do you know how to convert .ods files to csv or xls?

Hey Christian, 

 

.ods files are basically a zip file containing content, style and metadata in (mostly) xml format. You can use this function to get to these files. Reading content.xml should give you a way to get to your data. However.... 🤔 ods is easily converted into xlsx using Microsoft Excel, which  is way  (and then some) easier. Do remember to post your results if you choose to take the long way 😋

 

Good luck! and have fun 

 

P.S. I'm always happy to help, but i dont have this script written yet