Solved: Re: Suggestions for iterating over files in lakeho...

scabral · ‎03-08-2024

so this is a 2 part question.

i've created a pipeline that queries the Power BI REST API get activity events for a day or multiple days (parameter) and stores the data for each day in a file in a lakehouse with date appended to the filename. This part is working.

My first question is there are different file formats to choose from in the copy activity when creating the destination file (JSON, Delimited, Parquet, Binary, Avro, ORC). Right now i just chose delimited and the pipeline creates .txt files for each day that I pass in the parameter and stores them in the files folder in the lakehosue. What i eventually want to do in the pipeline is read the data from these files, most likely just about 12-15 columns which will always be the same each time, and insert the rows into a table in the lakehouse. What is the best file format to choose for this process?

My second question is what is the best way to iterate all the files in the lakehouse and select the columns i need and insert them into the lakehouse table. My guess is that there are probably better formats to handle the select query than others and what activities should i use to loop through each file and move the needed columns and rows into the lakehouse table?

thanks

Scott

v-gchenna-msft · ‎03-11-2024

Hi @scabral ,

Thanks for using Fabric Community.
Choosing the most suitable approach depends completely on your specific needs and type of data thats being handled. Here are some suggestions I have based on my understanding of your scenario:

File Format Selection:

For your scenario, where you'll be reading specific columns from the data and inserting them into a lakehouse table, columnar file formats are the clear choice. Here's a breakdown of the most suitable option:

Parquet: This is a widely adopted, efficient format that excels in columnar storage and fast querying. It supports schema evolution, making it adaptable to future data changes. It's a great default choice for most use cases.

Advantages of Columnar Formats:

Faster Query Performance: Since you'll only be reading specific columns, columnar formats significantly improve query speed by skipping irrelevant data during processing.
Efficient Storage: By storing data in columns instead of rows, they reduce storage footprint and optimize data transfer.

Delimited Text (CSV) Considerations:

While delimited text (CSV) might seem familiar, it's not ideal due to:

Slower Processing: Row-based structure requires reading the entire row even if you only need a few columns.
Larger File Size: Less efficient compression compared to columnar formats.

Iteration and Data Processing:

Here's a recommended approach for iterating through files and inserting data into your lakehouse table:

ForEach Activity is ideal for iterating through files.
Separate activities handle file metadata retrieval, data reading, and table insertion.
Data Factory Python SDK facilitates integration with Fabric Notebooks.

By following these recommendations, you'll establish a more efficient and scalable pipeline for processing your Power BI activity event data in your lakehouse.

Hope this is helpful. Please let me know incase of further queries.

View solution in original post

v-gchenna-msft · ‎03-11-2024

Hi @scabral ,

Thanks for using Fabric Community.
Choosing the most suitable approach depends completely on your specific needs and type of data thats being handled. Here are some suggestions I have based on my understanding of your scenario:

File Format Selection:

For your scenario, where you'll be reading specific columns from the data and inserting them into a lakehouse table, columnar file formats are the clear choice. Here's a breakdown of the most suitable option:

Parquet: This is a widely adopted, efficient format that excels in columnar storage and fast querying. It supports schema evolution, making it adaptable to future data changes. It's a great default choice for most use cases.

Advantages of Columnar Formats:

Faster Query Performance: Since you'll only be reading specific columns, columnar formats significantly improve query speed by skipping irrelevant data during processing.
Efficient Storage: By storing data in columns instead of rows, they reduce storage footprint and optimize data transfer.

Delimited Text (CSV) Considerations:

While delimited text (CSV) might seem familiar, it's not ideal due to:

Slower Processing: Row-based structure requires reading the entire row even if you only need a few columns.
Larger File Size: Less efficient compression compared to columnar formats.

Iteration and Data Processing:

Here's a recommended approach for iterating through files and inserting data into your lakehouse table:

ForEach Activity is ideal for iterating through files.
Separate activities handle file metadata retrieval, data reading, and table insertion.
Data Factory Python SDK facilitates integration with Fabric Notebooks.

By following these recommendations, you'll establish a more efficient and scalable pipeline for processing your Power BI activity event data in your lakehouse.

Hope this is helpful. Please let me know incase of further queries.

v-gchenna-msft · ‎03-11-2024

Hello @scabral ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .

v-gchenna-msft · ‎03-13-2024

Hi @scabral ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .

scabral · ‎03-13-2024

I did create a process in Fabric data factory given some of the points that were listed above.

I created a new data pipeline with the following activities:

1. a ForEach activity that calls the Power BI REST API for each day (parameter driven) and calls the getctivities events and stores the data in PARQUET files for each day in a new lakehouse.

2. a GetMetadata activity to get the list of file names from the lakehouse

3. a ForEach activity that uses a copy activity to load each PARQUET file into a lakehouse table and archive each file into an archive folder

i still need to make some minor tweaks, but the overall pipeline is working.

v-gchenna-msft · ‎03-14-2024

Hi @scabral ,

Glad to know that your query got resolved. Please continue using Fabric Community for your further queries.

Suggestions for iterating over files in lakehouse

Helpful resources

Fabric Monthly Update - April 2024