Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
scabral
Helper IV
Helper IV

Suggestions for iterating over files in lakehouse

so this is a 2 part question.

 

i've created a pipeline that queries the Power BI REST API get activity events for a day or multiple days (parameter) and stores the data for each day in a file in a lakehouse with date appended to the filename.  This part is working.  

 

My first question is there are different file formats to choose from in the copy activity when creating the destination file (JSON, Delimited, Parquet, Binary, Avro, ORC).  Right now i just chose delimited and the pipeline creates .txt files for each day that I pass in the parameter and stores them in the files folder in the lakehosue.  What i eventually want to do in the pipeline is read the data from these files, most likely just about 12-15 columns which will always be the same each time, and insert the rows into a table in the lakehouse.  What is the best file format to choose for this process?

 

My second question is what is the best way to iterate all the files in the lakehouse and select the columns i need and insert them into the lakehouse table.  My guess is that there are probably better formats to handle the select query than others and what activities should i use to loop through each file and move the needed columns and rows into the lakehouse table?

 

thanks

Scott

1 ACCEPTED SOLUTION
v-gchenna-msft
Community Support
Community Support

Hi @scabral ,

Thanks for using Fabric Community.
Choosing the most suitable approach depends completely on your specific needs and type of data thats being handled. Here are some suggestions I have based on my understanding of your scenario:

File Format Selection:

For your scenario, where you'll be reading specific columns from the data and inserting them into a lakehouse table, columnar file formats are the clear choice. Here's a breakdown of the most suitable option:

  • Parquet: This is a widely adopted, efficient format that excels in columnar storage and fast querying. It supports schema evolution, making it adaptable to future data changes. It's a great default choice for most use cases.

Advantages of Columnar Formats:

  • Faster Query Performance: Since you'll only be reading specific columns, columnar formats significantly improve query speed by skipping irrelevant data during processing.
  • Efficient Storage: By storing data in columns instead of rows, they reduce storage footprint and optimize data transfer.

Delimited Text (CSV) Considerations: 

While delimited text (CSV) might seem familiar, it's not ideal due to:

  • Slower Processing: Row-based structure requires reading the entire row even if you only need a few columns.
  • Larger File Size: Less efficient compression compared to columnar formats.

 

Iteration and Data Processing:

Here's a recommended approach for iterating through files and inserting data into your lakehouse table:

  • ForEach Activity is ideal for iterating through files.
  • Separate activities handle file metadata retrieval, data reading, and table insertion.
  • Data Factory Python SDK facilitates integration with Fabric Notebooks.


By following these recommendations, you'll establish a more efficient and scalable pipeline for processing your Power BI activity event data in your lakehouse.

Hope this is helpful. Please let me know incase of further queries.

View solution in original post

5 REPLIES 5
v-gchenna-msft
Community Support
Community Support

Hi @scabral ,

Thanks for using Fabric Community.
Choosing the most suitable approach depends completely on your specific needs and type of data thats being handled. Here are some suggestions I have based on my understanding of your scenario:

File Format Selection:

For your scenario, where you'll be reading specific columns from the data and inserting them into a lakehouse table, columnar file formats are the clear choice. Here's a breakdown of the most suitable option:

  • Parquet: This is a widely adopted, efficient format that excels in columnar storage and fast querying. It supports schema evolution, making it adaptable to future data changes. It's a great default choice for most use cases.

Advantages of Columnar Formats:

  • Faster Query Performance: Since you'll only be reading specific columns, columnar formats significantly improve query speed by skipping irrelevant data during processing.
  • Efficient Storage: By storing data in columns instead of rows, they reduce storage footprint and optimize data transfer.

Delimited Text (CSV) Considerations: 

While delimited text (CSV) might seem familiar, it's not ideal due to:

  • Slower Processing: Row-based structure requires reading the entire row even if you only need a few columns.
  • Larger File Size: Less efficient compression compared to columnar formats.

 

Iteration and Data Processing:

Here's a recommended approach for iterating through files and inserting data into your lakehouse table:

  • ForEach Activity is ideal for iterating through files.
  • Separate activities handle file metadata retrieval, data reading, and table insertion.
  • Data Factory Python SDK facilitates integration with Fabric Notebooks.


By following these recommendations, you'll establish a more efficient and scalable pipeline for processing your Power BI activity event data in your lakehouse.

Hope this is helpful. Please let me know incase of further queries.

Hello @scabral ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .

Hi @scabral ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .

I did create a process in Fabric data factory given some of the points that were listed above.

 

I created a new data pipeline with the following activities:

1. a ForEach activity that calls the Power BI REST API for each day (parameter driven) and calls the getctivities events and stores the data in PARQUET files for each day in a new lakehouse.

2. a GetMetadata activity to get the list of file names from the lakehouse

3. a ForEach activity that uses a copy activity to load each PARQUET file into a lakehouse table and archive each file into an archive folder

 

i still need to make some minor tweaks, but the overall pipeline is working.

Hi @scabral ,

Glad to know that your query got resolved. Please continue using Fabric Community for your further queries.

Helpful resources

Announcements
April Fabric Update Carousel

Fabric Monthly Update - April 2024

Check out the April 2024 Fabric update to learn about new features.

Top Solution Authors