Solved: • Data Flow - SharePoint Folder Connector REALLY S...

Anonymous · ‎03-21-2021

Hello community.

I'm building a Data Flow using SharePoint Folder as datasource. I'm connecting to a 1GB FOLDER that holds 60 .txt files (25MB on avg per file). The combine and load step is required, because I need to combine all 60 files in order to have the required data.

Now that I have the connection to the folder stablished, and performed SIMPLE transformations, the evaluation keeps getting cancelled, and I have no Idea why... I'm attaching a print screen from data flow PQ editor, the total time to execute and memory. After that execution, the evaluation gets cancelled.

Here is the code in this entity, which is not loading. Since I connect to SharePoint, I had to filter the folder and files before starting to analyze...that's why there are some filters.

let
  Origem = SharePoint.Files("https://nutaxdigital.sharepoint.com/sites/NutaxAnalytics", [ApiVersion = 15]),
  #"Linhas filtradas" = Table.SelectRows(Origem, each [Extension] = ".txt"),
  #"Linhas filtradas 1" = Table.SelectRows(#"Linhas filtradas", each Text.Contains([Folder Path], "company_to_be_analyzed")),
  #"Arquivos ocultos filtrados" = Table.SelectRows(#"Linhas filtradas 1", each [Attributes]?[Hidden]? <> true),
  #"Invocar a função personalizada" = Table.AddColumn(#"Arquivos ocultos filtrados", "Transformar o arquivo", each #"Transformar o arquivo"([Content])),
  #"Colunas renomeadas" = Table.RenameColumns(#"Invocar a função personalizada", {{"Name", "Nome da Origem"}}),
  #"As outras colunas foram removidas" = Table.SelectColumns(#"Colunas renomeadas", {"Nome da Origem", "Transformar o arquivo"}),
  #"Coluna de tabela expandida" = Table.ExpandTableColumn(#"As outras colunas foram removidas", "Transformar o arquivo", Table.ColumnNames(#"Transformar o arquivo"(#"Arquivo de exemplo"))),
  #"Tipo de coluna alterado" = Table.TransformColumnTypes(#"Coluna de tabela expandida", {{"Nome da Origem", type text}, {"Column1", type text}}),
  #"Linhas filtradas 2" = Table.SelectRows(#"Tipo de coluna alterado", each Text.Contains([Nome da Origem], "txt_files_to_be_analyzed")),
  #"Linhas filtradas 3" = Table.SelectRows(#"Linhas filtradas 2", each Text.StartsWith([Column1], "|0000|")),
  #"Dividir Coluna por Delimitador" = Table.SplitColumn(#"Linhas filtradas 3", "Column1", Splitter.SplitTextByDelimiter("|", QuoteStyle.None), {"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7", "Column1.8", "Column1.9", "Column1.10", "Column1.11", "Column1.12", "Column1.13", "Column1.14", "Column1.15", "Column1.16", "Column1.17"}),
  #"Colunas Removidas" = Table.RemoveColumns(#"Dividir Coluna por Delimitador", {"Column1.1"}),
  #"Colunas Renomeadas2" = Table.RenameColumns(#"Colunas Removidas", {{"Column1.2", "REG"}, {"Column1.3", "COD_VER"}, {"Column1.4", "COD_FIN"}, {"Column1.5", "DT_INI"}, {"Column1.6", "DT_FIN"}, {"Column1.7", "NOME"}, {"Column1.8", "CNPJ"}, {"Column1.9", "CPF"}, {"Column1.10", "UF"}, {"Column1.11", "IE"}, {"Column1.12", "COD_MUN"}, {"Column1.13", "IM"}, {"Column1.14", "SUFRAMA"}, {"Column1.15", "IND_PERFIL"}, {"Column1.16", "IND_ATIV"}}),
  #"Colunas Removidas1" = Table.RemoveColumns(#"Colunas Renomeadas2", {"Column1.17"})
in
  #"Colunas Removidas1"

v-deddai1-msft · ‎03-23-2021

Hi @Anonymous ,

There are many causes lead to slow performance. You are using flat files as datasource, so there isn't query folding in your datasource. The mashup engine needs to load all files at the same time then do transitions, it will lead to large memory use . The complex transition will leads to large cpu use.

Are you using dataflow in premium? You can try to enable Enhanced Dataflows Compute Engine and enlarge the Max Memory in workload. Please refer to https://docs.microsoft.com/en-us/power-bi/transform-model/dataflows/dataflows-premium-workload-configuration

And use computed entity is also a option for you.

Here is a official document for your reference: https://docs.microsoft.com/en-us/power-query/optimize-expanding-table-columns . The expand steps could be the issue of slow performance. when you use join and expand steps, it will increases the number of HTTP calls by one for each row in your datasource.

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Best Regards,

Dedmon Dai

View solution in original post

selimovd · ‎03-21-2021

Hey @Anonymous ,

did the code work in Power Query within Power BI Desktop?

You should be able to copy and paste the code from data flows to Power Query.

If that works, but it's too slow, I would maybe temporary remove all files from the folder except for 2 or 3 and test it then.

Then I would move the rest back and try to let the data flow run.

If you need any help please let me know.

If I answered your question I would be happy if you could mark my post as a solution ✔️ and give it a thumbs up 👍

Best regards

Denis

Blog: WhatTheFact.bi

Element115 · ‎05-20-2022

Source table size < 300K rows, accessing data via an on-prem Gateway, from Power BI Desktop Aprl & now May version.

I have a similar situation but it is even worse as the Mashup Engine threads water forever in the last step of a reference chain of intermediate queries with Enable Load = OFF. This is what it looks like:

query_0 --> query_1 --> query_2 --> query_3 --> final_query

A few days ago, with the same M code, all was working and being processed at an acceptable speed. But after a Save As and a machine reboot (using the April version), suddenly query_2 and query_3 would take the 3+ hours to complete! Same happens after updating Power BI to the May version.

I log in today, see that finally query_0, query_1, query_2, query_3 have all completed but there still is a question mark next to final_query. I click on final_query and it goes on and on forever. And the code of the final query is this simple:

let
    Source = query_3
in
   Source

It is just a reference to the previous query, which has completed. And when looking at the Task Manager, the Mashup Engine is showing Very high power usage using > 11% CPU and 2GB of RAM!! What on Earth is it doing? All the code says is pull the data from the previous query into a table and bloody load it into the model! Why is it stalling when all previous queries have completed successfully?

Anonymous · ‎03-21-2021

Hello my friend. Thanks for the reply. Yes, this code was actually created in PBI Desktop... it was really slow to load all the data ( it used to take 1-2 hours to be able to load in the report), but it worked in PBI Desktop...my idea was to move to data flow to optimize the query performance and be able load the report faster.

I just tried to update the query online with a new folder with only 2-3 files...but it's still very very slow 😞

it does not makes sense....there is like, 80MB in the folder.

v-deddai1-msft · ‎03-23-2021

Hi @Anonymous ,

There are many causes lead to slow performance. You are using flat files as datasource, so there isn't query folding in your datasource. The mashup engine needs to load all files at the same time then do transitions, it will lead to large memory use . The complex transition will leads to large cpu use.

Are you using dataflow in premium? You can try to enable Enhanced Dataflows Compute Engine and enlarge the Max Memory in workload. Please refer to https://docs.microsoft.com/en-us/power-bi/transform-model/dataflows/dataflows-premium-workload-configuration

And use computed entity is also a option for you.

Here is a official document for your reference: https://docs.microsoft.com/en-us/power-query/optimize-expanding-table-columns . The expand steps could be the issue of slow performance. when you use join and expand steps, it will increases the number of HTTP calls by one for each row in your datasource.

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Best Regards,

Dedmon Dai

Element115 · ‎05-20-2022

Source table size < 300K rows, accessing data via an on-prem Gateway, from Power BI Desktop Aprl & now May version.

I have a similar situation but it is even worse as the Mashup Engine threads water forever in the last step of a reference chain of intermediate queries with Enable Load = OFF. This is what it looks like:

query_0 --> query_1 --> query_2 --> query_3 --> final_query

A few days ago, with the same M code, all was working and being processed at an acceptable speed. But after a Save As and a machine reboot (using the April version), suddenly query_2 and query_3 would take the 3+ hours to complete! Same happens after updating Power BI to the May version.

I log in today, see that finally query_0, query_1, query_2, query_3 have all completed but there still is a question mark next to final_query. I click on final_query and it goes on and on forever. And the code of the final query is this simple:

let
    Source = query_3
in
   Source

It is just a reference to the previous query, which has completed. And when looking at the Task Manager, the Mashup Engine is showing Very high power usage using > 11% CPU and 2GB of RAM!! What on Earth is it doing? All the code says is pull the data from the previous query into a table and bloody load it into the model! Why is it stalling when all previous queries have completed successfully?

Anonymous · ‎03-25-2021

Hello @v-deddai1-msft and @selimovd , thanks for the reply and support. I found a way to optimize my query, the SharePoint that was hosting my folders and files, was too ''busy'', it was hosting a lot of folders and files...It was taking forever to find the folder to use as dasource. So I created a separate Site to use as repository for flat files, now te performance is better...not super fast, but I can manage. I'll read that documentation regarding expanding and merging queries.

Thanks again both of you for the support!!!

• Data Flow - SharePoint Folder Connector REALLY SLOW - Evaluation Cancelled

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024