Working With Zip Files in Power Query

lbendlin · ‎06-29-2020

Using ZIP files directly in Power BI is an attractive idea. In your data source you can store your raw data efficiently – oftentimes the CSV files inside the ZIP are very suitable for compression – and when accessing the ZIP rather than the CSV you can reduce the network traffic as well (let’s ignore the scenario where that uses compression too…) and improve the dataset refresh times.

Here’s where the ZIP ecosystem throws a wrench into your plans. ZIP files come in all kinds of formats, and while most adhere to the basic standards of how a ZIP file should be structured, many legacy systems produce ZIP files that are decidedly non-standard.

For this article let’s assume that we are working with a very simple scenario – one ZIP file that only contains one compressed file. (For a full solution that will work with multiple compressed files refer to https://community.powerbi.com/t5/Power-Query/How-to-connect-Azure-DevOps-REST-API-in-to-power-bi/m-p... by artemus)

(A quick excursion into the standard ZIP structure: https://en.wikipedia.org/wiki/Zip_(file_format)#Structure)

The compressed files are stored in the ZIP one after the other. Each file has its own local header (the File Entry Header), and at the end of the ZIP file a Central Directory repeats the file header information.

Both headers are supposed to contain information about the size of the compressed file. This is where the problem comes in with some of the ZIP files. They ignore the local header (the file sizes are recorded as zero) and only write to the central directory.

This means that Power Query code that reads the ZIP file from the beginning (the majority of code examples do it that way) will be unable to process ZIP files that do not have the proper local header information.

Let’s explore the alternative option – reading the ZIP file from the end and using the central directory information.

Admittedly the following approach assumes that no comments are entered at the very end of the ZIP file. Hopefully that is true for your ZIP files...

Here is a ZIP file where the local header is missing the file size information:

Annotation 2020-06-29 154645.png

But the central directory has it:

Annotation 2020-06-29 154855.png

In order to get to that file size information, we have to do some pointer math. First, we have to load the entire ZIP file into memory. Since that is a costly operation, we better make sure we only do it once. Let's stuff the data into a binary buffer:

(ZIPFile) =>
let
    Source = Binary.Buffer(File.Contents(ZIPFile)),
...

Then we need to get the start location of the central directory. Again, assuming there are no comments in the Central directory, the start location is stored in a pointer that is located in the bytes 6 to 3 from the end of the ZIP file. To get these we create record that basically skips the entire buffer except for the last six bytes, and then reads the first four bytes of these last six bytes.

...
    Size = Binary.Length(Source),
    Directory = BinaryFormat.Record([ 
                    MiscHeader=BinaryFormat.Binary(Size-6), 
                    Start=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian)
            ]) ,
    Start = Directory(Source)[Start],
...

Next up - we need to get the size of the compressed file in the archive, and we need to know where that data starts - there is still the file name and the "extra data" content that we need to skip.

...
    FirstDirectoryEntry = BinaryFormat.Record([ 
                    MiscHeader=BinaryFormat.Binary(Start+20), 
                    FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
                    UnCompressedFileSize=BinaryFormat.Binary(4),
                    FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
                    ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian)
            ]) ,
...

Note that the MiscHeader in the central directory is 20 bytes whereas in the local header it is 18 bytes. We don't really care about the UncompressedFileSize, so we just skip its four bytes in the record definition.

Now that we have all required data, we can calculate the offset and the data range for our compressed file. Adding one byte at the end to close it out neatly.

...
    Offset = 30+FirstDirectoryEntry(Source)[FileNameLen]+FirstDirectoryEntry(Source)[ExtrasLen],
    Compressed = FirstDirectoryEntry(Source)[FileSize]+1,
    Raw = BinaryFormat.Record([
                    Header=BinaryFormat.Binary(Offset), 
                    Data=BinaryFormat.Binary(Compressed)
            ]) 
...

That's pretty much it. All that's left to do now is to decompress the data and return it to whoever called our function.

...
in 
    Binary.Decompress(Raw(Source)[Data], Compression.Deflate)

Note that throughout the process we have used the binary buffer to great effect. Even though we had to read the entire ZIP file into memory we only had to do it once.

As mentioned in the beginning this approach will only work for the first compressed file in a ZIP archive that doesn't use Central Directory comments. If your ZIP files are more elaborate, then please have a look at the implementation by artemus.

For reference here is the complete code for the Power Query function:

// expects full path to the ZIP file, only extracts the first data file after getting its size from the central directory
// https://en.wikipedia.org/wiki/Zip_(file_format)#Structure
(ZIPFile) =>
let
    //read the entire ZIP file into memory - we'll use it often so this is worth it
    Source = Binary.Buffer(File.Contents(ZIPFile)),
    // get the full size of the ZIP file
    Size = Binary.Length(Source),
    //Find the start of the central directory at the sixth to last byte
    Directory = BinaryFormat.Record([ 
                    MiscHeader=BinaryFormat.Binary(Size-6), 
                    Start=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian)
            ]) ,
    Start = Directory(Source)[Start],
    //find the first entry in the directory and get the compressed file size
    FirstDirectoryEntry = BinaryFormat.Record([ 
                    MiscHeader=BinaryFormat.Binary(Start+20), 
                    FileSize=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
                    UnCompressedFileSize=BinaryFormat.Binary(4),
                    FileNameLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
                    ExtrasLen=BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian)
            ]) ,
    //figure out where the raw data starts            
    Offset = 30+FirstDirectoryEntry(Source)[FileNameLen]+FirstDirectoryEntry(Source)[ExtrasLen],
    Compressed = FirstDirectoryEntry(Source)[FileSize]+1,
    //get the raw data of the compressed file
    Raw = BinaryFormat.Record([
                    Header=BinaryFormat.Binary(Offset), 
                    Data=BinaryFormat.Binary(Compressed)
            ]) 
    // unzip it
in 
    Binary.Decompress(Raw(Source)[Data], Compression.Deflate)

Working With Zip Files in Power Query

Unveiling the Power of Lakehouse's SQL Analytics E...

How to provide sample data in the Power BI Forum

Re: Power BI Tip: How To Sort A Column With Duplic...

Important Links every Power BI enthusiastic shoul...

Enhanced Data Profiling in Power Query: GUI and Ta...