Reading large Azure data lake parquet files in pow...

a_pereira · ‎08-10-2022

Hi everyone,

I would like to implement a dashboard in power bi using a parquet file from Azure Data Lake blob storage.

It contains 4 columns (Date, ID, Product Price, number of stores) and this dashboard would later on be filtered by ID.

I thought I found the answer by using Azure Data Lake storage Gen 2 "Get Data" method and adding the parquet file URL in the data lake as bellow :

https://<accountname>.dfs.core.windows.net/<container>/<subfolder>

I get access to the correct file and I can combine and load the data.

Unfortunately, only the import method is available and the file is way too heavy to be imported in Power Bi Desktop. (~70GB)

I then found some documentation stating that there was a possibility to use Dataflows with Direct Query to avoid importing the data and still building reports by querying directly from the Azure Data Lake Storage.

I found how to connect Azure to Power Bi and build Dataflows but can't manage to reach the parquet file I need.

I can't find a way to create the Common Data Model folder necessary to implement in the Dataflow to get the data directly from ADLS. (But found the way to create a dataflow via CDM folder)

Could you please help with creating a CDM file in Azure Data Lake Storage ?

I also saw in the documentation that only CSV files are permitted in a CDM file, is that so ? Can I not use my parquet file as a data source ?

If this is not the correct way to do so, could you please provide me with a solution for querying/importing large datasets (~70GB) into Power Bi Desktop with Azure Data Lake Storage without modifying the original dataset ?

Thank you for your help,

AP

bcdobbs · ‎08-10-2022

I'd agree with @TomMartens

As an alternative to the a spark sql pool you could give serverless sql in synapse ago. You can create a view against your exisiting data lake parquet which you can then use direct query against.

Have a look at https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-views

Ben Dobbs

LinkedIn | Twitter | Blog

Did I answer your question? Mark my post as a solution! This will help others on the forum!
Appreciate your Kudos!!

Azure_newbie · ‎08-11-2023

Sorry to ask a follow up question in old thread. May I know what is the license requirement to setup serverless sql in synapse? Recently Fabric is released and it seems Synapse is under Fabric's umbrella. Also do we need SQL server license in this case?

R1k91 · ‎08-11-2023

Fabric is in preview and you should not use it in prod.
Synapse is a separate product. to use serverless cluster you just need to create a synapse worksapce in azure. it'll have a default synapse serveless pool that you can use. it's pay per use. you pay per query and data movement.

Azure_newbie · ‎08-11-2023

Is below the correct link I should look into?
Pricing - Azure Synapse Analytics | Microsoft Azure

R1k91 · ‎08-15-2023

yes it is.

serverless is under data warehousing workloads

dedicated is under the same category but is optional

a_pereira · ‎08-10-2022

Up

TomMartens · ‎08-10-2022

Hey @a_pereira ,

when you are starting to create a new dataflow select "Define you new table" instead of "Attach a Common data model folder"

I assume this will help, but please be aware that there might be data size limits as well. DirectQuery to dataflow does not mean the Parquet files will be queried in DQ mode, instead the dataflow will be queried.

If you have a Spark/Databricks Cluster, I would recommend using DQ against a Spark SQL query.

Hopefully, this provides some ideas on how to tackle your challenge.

Regards,
Tom

Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany

Reading large Azure data lake parquet files in power bi

Helpful resources

New forum boards available in Real-Time Intelligence.

Power BI Monthly Update - May 2024

Jumpstart your career with the Fabric Career Hub

Reading large Azure data lake parquet files in power bi

Helpful resources

New forum boards available in Real-Time Intelligence.

Power BI Monthly Update - May 2024