PBI Dataflow - connect to a Parquet dataset on Azure Blob
I have some parquet datasets saved on Azure Blob. I am currently creating dataflows by using something similar to this:
#then I filter blobs based on the path, I generally partition data based on the date, so root/date=2021-01-12/guid.parquet
#I split the date from the path and create a timestamp which I use for incremental refresh as well (just to limit the amount of files I am dealing with)
#to read the data, I use Parquet.Document()
Table.AddColumn(filter_files, "data", each Parquet.Document([Content]))
Is there a more direct way to connect to data stored in this manner? It seems like a waste to have PBI guess data dtypes for each column when there is a dataset.schema and potentially row group level statistics, and it is columnar storage so ideally we could completely avoid reading in certain columns.
I am able to query these datasets using pyarrow, dask, PySpark, and Apache Drill. Due to partition-level and then row-level filtering, I am able to retrieve results on 100 million rows in just a couple of second. Perhaps I am missing something with PBI? Is there a more direct way to take advantage the Parquet file format and "parquet datasets" more directly?
According to your description, you said that you are able to retrieve results on 100 million rows in just a couple of seconds due to partition-level and then row-level filtering. It’s a very efficient query speed, which I think is enough for some daily query operations. Are you still seeking a better way to connect Power BI with Parquet dataset?
I’m not very familiar with Parquet dataset on Azure Blob, maybe you can @ some top solution author to help you or open a case in the Power BI Developer forum for some help with solutions using rest API.
Community Support Team _Robert Qin
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Sorry - I meant outside of PBI I am able to do that on the same source PBI is connected to. It takes <10 seconds to read the dataset (nothing in memory at that point) where I can then filter the data (ideally by the partitions first to eliminate I/O, and then row level filtering if required).
With PBI, I need to:
1) List the blob containers
2) Filter the folder name to the dataset I am interested in reading
3) Parse the folder names to eventually read the actual parquet files (there can be many thousands of individual files)
4) (I personally filter these files with RangeStart and RangeEnd parameters from incremental refreshes)
5) Read the ParquetDocument() for each file - there does not seem to be a way to select columns in this step either which is odd since Parquet is a columnar format, so reading all columns kind of defeats a major benefit. The documentation is limited here: https://docs.microsoft.com/en-us/powerquery-m/parquet-document
6) Expand the content
7) Fix the data types for each column (parquet datasets have a schema, so ideally PBI would not need to guess the data types at all)
It just does not seem use any of the advantages of the Parquet format. Am I doing it incorrectly? Is Parquet.Document() the best way to read data in from Azure Blob into a dataflow?
I do see that in March 2021, there will be an option to save the backend dataflow data in Parquet instead of the default CSV, so perhaps there will be more support coming soon: