Solved: Re: Staging Dataflows -- Linked tables from datafl...

jaykiu · ‎10-25-2021

Good Afternoon,

I've been trying to lessen the load on our SQL server by moving queries to a staging Dataflow, but I must not have a clear understanding of how it is supposed to work. I have one Dataflow {A} that connects to the database through my on-premise gateway and multiple dataflows {C,D,E} that use (Reference) the linked tables to become calculated tables as my transformation Dataflows.

I use these tables from the Transformation dataflow to import into power bi desktop and create my report.

My understanding is that after I refresh my staging dataflow {A} this becomes the "Source" for transformations, but what I end up seeing is my gateway PC max out in resource monitor when I'm editing the transformation dataflows.

Do I have a faulty understanding of how it is supposed to work? Was there a setting that needed to be enabled in order to use the "Staging Dataflow" or is a dataflow that doesn't contain modifications a staging dataflow?

Any suggestions on what I can do? I can't query the SQL server every time I'm working on a transformation, it impacts other users by slowing down the server.

Thanks

edhans · ‎10-26-2021

While you are working, the dataflow is folding queries back to the server, and it does this every time you make a change.

It depends on how large the data is, but you can block that folding during development by adding Table.Buffer() around one of the statements. So if you have something like a Table.AddColumns(#"Some Step Name", "New Field", each [Value] + 1), change it to Table.Buffer(Table.AddColumns(#"Some Step Name", "New Field", each [Value] + 1))

Now it will hold the table in memory and do everything locally in the gateway or service.

When you are done and have everything working, remove Table.Buffer. The dataflow will send one query to the server to get the data for as much of your transformations as it can (what will fold) and will do the rest locally.

Note: If your table has millions of records, this won't work. It will just cause your memory to fill up. If that is the case, for the purposes of work, add a step early on to select the top 10,000 records or something. Enough to work with, but without holding a few GB of data in memory. Then do the Table.Buffer trick.

Then remove Table.Buffer and the top 10,000 records limiter step before the final save.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

View solution in original post

edhans · ‎10-27-2021

I think that is correct. Your thinking here is 100% on point, but during the development of the dataflow it is a load on the server.

Another alternative - do you have a test or dev environment? You could develop the dataflow on that, then simply change the database/server names when done.

But even if you don't, once the dev work is finished, the dataflow will be the only impact on your SQL server, refreshing however many times a day it needs to. Everyone else will write reports from the dataflows.

One other thought. Not totally best practice, but will really reduce load on the servers. Just copy the tables straight to the dataflow, unaltered except maybe date filters so you don't pull in 20 years of data. Then write your transformation dataflows off of those. You'll import more data but that is less of a load on the servers vs continually writing transformations.

If you were good with M code, you could then take those transformations and apply to the actual SQL tables. Essentially you are creating your own DEV environment. But you have to know how to work in the advanced editor to move the M code around. On a scale of 1-10, I'd say you need to be a 5-7 in M code to do that effectively.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

View solution in original post

edhans · ‎10-27-2021

You are copying (importing) data, but Power Query creates SQL statements, called Query Folding, to the server. So it offloads the transformations to the server, then the server sends the results back. that is what you are importing. But during the development process, it is effectively DirectQuery. Query Folding is actually a byproduct of the DirectQuery process.

Huge benefits. Say you do a big transformation to:

connect to a table or view
Filter for the last 3 years of data
remove all but 10 columns
group by 3 columns and summarize 7 of them.
add 2 more computed columns
Rename the columns
Change the data types.

Power Query will analyze all of those, and will figure out that steps 1-6 can be sent to the server. So it sends a single query to the server and the server returns a single table representing the result of those 6 transformations. Then PQ applies the data types (the server cannot do that) and step 7 breaks folding. Then it loads. If you do other steps, 8-15 for example, chance are most of those will be happening locally in the gateway. Not always. PQ might look at steps 8-10 and see those can be moved to after 6 in the query optimization plan, and send those to the server, then do the folding breaking step.

But you are developing, so when you do steps 1-6 above, those go to the server every single time. In fact, when you step 3, it sends 1-3 as one query. When you step 4, it redoes the query and sends 1-4. Then 1-5, etc. That is why the development process can be murder on an overworked server.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

View solution in original post

edhans · ‎10-26-2021

While you are working, the dataflow is folding queries back to the server, and it does this every time you make a change.

It depends on how large the data is, but you can block that folding during development by adding Table.Buffer() around one of the statements. So if you have something like a Table.AddColumns(#"Some Step Name", "New Field", each [Value] + 1), change it to Table.Buffer(Table.AddColumns(#"Some Step Name", "New Field", each [Value] + 1))

Now it will hold the table in memory and do everything locally in the gateway or service.

When you are done and have everything working, remove Table.Buffer. The dataflow will send one query to the server to get the data for as much of your transformations as it can (what will fold) and will do the rest locally.

Note: If your table has millions of records, this won't work. It will just cause your memory to fill up. If that is the case, for the purposes of work, add a step early on to select the top 10,000 records or something. Enough to work with, but without holding a few GB of data in memory. Then do the Table.Buffer trick.

Then remove Table.Buffer and the top 10,000 records limiter step before the final save.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

jaykiu · ‎10-27-2021

So as my Dataflow A feeds other reports and Dataflows I don't want to change it to TopN.

Will this work?

Dataflow B:

1. Link Dataflow A:[SomeTable]

2. Reference [SomeTable] to become a calculated table:

3. TopN for 10,000 rows

4. Table.Buffer

5. Perform transformations

6. Remove Table Buffer after work is completed and ready to use in a report.

If that doesn't work, what is the standard solution for using a live active database? Our server absolutely hates me and with table locks from other users and long read cycles from me, I cripple our systems almost immediately. (Other known issues at play of course) What I want to do is read each table only once a day, build transformations and then set a refresh schedule to occur outside of business hours.

edhans · ‎10-27-2021

I think that is correct. Your thinking here is 100% on point, but during the development of the dataflow it is a load on the server.

Another alternative - do you have a test or dev environment? You could develop the dataflow on that, then simply change the database/server names when done.

But even if you don't, once the dev work is finished, the dataflow will be the only impact on your SQL server, refreshing however many times a day it needs to. Everyone else will write reports from the dataflows.

One other thought. Not totally best practice, but will really reduce load on the servers. Just copy the tables straight to the dataflow, unaltered except maybe date filters so you don't pull in 20 years of data. Then write your transformation dataflows off of those. You'll import more data but that is less of a load on the servers vs continually writing transformations.

If you were good with M code, you could then take those transformations and apply to the actual SQL tables. Essentially you are creating your own DEV environment. But you have to know how to work in the advanced editor to move the M code around. On a scale of 1-10, I'd say you need to be a 5-7 in M code to do that effectively.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

jaykiu · ‎10-27-2021

Thank you for your help Edhans,

@edhans wrote:
But even if you don't, once the dev work is finished, the dataflow will be the only impact on your SQL server, refreshing however many times a day it needs to. Everyone else will write reports from the dataflows.

The Dev work includes all of the downstream dataflows correct?

One other thought. Not totally best practice, but will really reduce load on the servers. Just copy the tables straight to the dataflow, unaltered except maybe date filters so you don't pull in 20 years of data. Then write your transformation dataflows off of those. You'll import more data but that is less of a load on the servers vs continually writing

I was under the impression I was importing directly to the data flow. How do I go about copying the tables to the data flow? That is what I think I was trying to do?

To my knowledge, I am not using direct query.

Thanks again,

edhans · ‎10-27-2021

You are copying (importing) data, but Power Query creates SQL statements, called Query Folding, to the server. So it offloads the transformations to the server, then the server sends the results back. that is what you are importing. But during the development process, it is effectively DirectQuery. Query Folding is actually a byproduct of the DirectQuery process.

Huge benefits. Say you do a big transformation to:

connect to a table or view
Filter for the last 3 years of data
remove all but 10 columns
group by 3 columns and summarize 7 of them.
add 2 more computed columns
Rename the columns
Change the data types.

Power Query will analyze all of those, and will figure out that steps 1-6 can be sent to the server. So it sends a single query to the server and the server returns a single table representing the result of those 6 transformations. Then PQ applies the data types (the server cannot do that) and step 7 breaks folding. Then it loads. If you do other steps, 8-15 for example, chance are most of those will be happening locally in the gateway. Not always. PQ might look at steps 8-10 and see those can be moved to after 6 in the query optimization plan, and send those to the server, then do the folding breaking step.

But you are developing, so when you do steps 1-6 above, those go to the server every single time. In fact, when you step 3, it sends 1-3 as one query. When you step 4, it redoes the query and sends 1-4. Then 1-5, etc. That is why the development process can be murder on an overworked server.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

edhans · ‎10-29-2021

@jaykiu - do you now have the answer you need? To be clear, Power Query never has any data. There is a cache for performance reasons, but Power Query is always directly accesing the source data. Import vs DirectQuery is what happens once that data is passed to the DAX Vertipaq engine. Power Query is just a tube to Extract, Transform, and Load data - ETL.

Can you please mark one or more responses as the solution so this thread can be closed? Thanks.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

jaykiu · ‎10-29-2021

@edhans So with the buffer.table() all I am doing is limiting the data flow to only call the source once per instance of the table. So it will always be painful to create the flows even on the 3rd-4th transformation layer but once I am back on Power BI desktop it will not query the source and I will be ok?

edhans · ‎10-29-2021

The Table.Buffer says "Gather this data and don't send anymore transformations back to the server." It does other things, but note: It may still re-buffer the data, but that rebufferring will be lightweight - just a data read of 10K records or so, no complex SQL. There is no way to fully isolate it, unless you were to pull in 10K records into Excel, then build all of your transofrmations on that, then put that M code back on the SQL connector.

Which is possible.

You can never get PQ to fully stop re-reading the data at some level in the dev process.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

lbendlin · ‎10-26-2021

Dataflows are blob storage similar to CSV or Parquet. They are good for spooling, but not for transform pipelines. Use datasets in Import mode (with incremental refresh if needed)

Staging Dataflows -- Linked tables from dataflow still causing gateway server to run query

Helpful resources

New forum boards available in Real-Time Intelligence.

Power BI Monthly Update - May 2024

Jumpstart your career with the Fabric Career Hub

Staging Dataflows -- Linked tables from dataflow still causing gateway server to run query

Helpful resources

New forum boards available in Real-Time Intelligence.

Power BI Monthly Update - May 2024