Architecture question with external parquet files

ceindev · ‎03-29-2024

Hello,

I am looking for advise on how to best PowerBI components to use given the following:

We have spark jobs running in our own compute that processes data. The spark jobs write the processes dataframes to Azure Data Lake Gen2 storage and are segmented into different storage accounts + containers for each project.

For example,

project1 -> spark job writes dataframe1, dataframe2, etc... to storage account 1, container = project1

project2 -> spark job writes dataframe1, dataframe2, etc... to storage account 2, container = project1

We have a set of powerBI reports that share a semantic model. What I want to do is as follows:

Using powerBI rest api...

for each project:

provision workspace with name = project namae

import our master set of powerBI files into workspace

update semantic model to point to the project's storage account + container's parquet files

We have most of this working with the exception of updating the semantic model. Here are questions:

1) Given that the source data is from parquet and that these files can be large, what is the best powerbi tech to use here? I see mentions of using a dataflows but I also see that lakehouses add query performance options.

2) Is there a nice and clean way to script out setting the datasource to the appropriate storage location as I outlined above?

lbendlin · ‎03-29-2024

1) You already have Parquet. Consume as is. No need to convert into anything else.

ceindev · ‎03-29-2024

Thank you for this information. Can you expand on this?

- What is the proper place to define the data source definitions? Is it a DataSet, Semantic Model, etc...?

- This is M code I have made to handle spark multi partitioned parquet files (is there a built-in one already)? How can I incorporate this into power bi rest api and what api endpoint do I use (datasource, dataset, dataflow, etc...)

let
Source = AzureStorage.DataLake("https://<storage account>.dfs.core.windows.net/<container for project>"),
#"Filtered rows" = Table.SelectRows(Source, each ([Extension] = ".parquet")),
Navigation = #"Filtered rows"{[#"Folder Path" = "xxx"]}[Content],
#"Imported Parquet" = Parquet.Document(Navigation)
in
#"Imported Parquet"

lbendlin · ‎03-29-2024

"dataset" is the legacy name for "semantic model". That's the place to define your relationships. Data Source definitions happen before that, in the Power Query phase.

Please elaborate on your "incorporate this into power bi rest api" comment - what are you trying to accomplish?

ceindev · ‎04-01-2024

Thank you for the additional information. Now I understand that I should update the dataset to point to different project data that exist on different azure datalake gen 2 storage accounts/containers.

So the step I would appreciate more details on is how to do this using the restapi

update semantic model to point to the project's storage account + container's parquet files

This documentation appears to say that the proper call is Update Parameters in Group

https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/update-datasources-in-group

Are there any examples of how this should look to be able to change a table using the M language I previously posted:

let
Source = AzureStorage.DataLake("https://<storage account>.dfs.core.windows.net/<container for project>"),
#"Filtered rows" = Table.SelectRows(Source, each ([Extension] = ".parquet")),
Navigation = #"Filtered rows"{[#"Folder Path" = "xxx"]}[Content],
#"Imported Parquet" = Parquet.Document(Navigation)
in
#"Imported Parquet"

Lastly, I just want to confirm that using parquet files (that may be large) will not suffer performance issues? I have seen other posts and documentation that suggest using other components in PowerBI (like adding compute, lakehouse, etc...)

lbendlin · ‎04-01-2024

I recommend you watch this space: Optimizations — Delta Lake Documentation

Z-Ordering seems to be the latest craze.

Architecture question with external parquet files

Helpful resources

Europe’s largest Microsoft Fabric Community Conference

New forum boards available in Real-Time Intelligence.

Jumpstart your career with the Fabric Career Hub

Architecture question with external parquet files

Helpful resources

Europe’s largest Microsoft Fabric Community Conference

New forum boards available in Real-Time Intelligence.