Solved: Do pipelines replicate data or just metadata about...

arpost · ‎07-06-2021

I'm interested in using the new Deployment Pipelines feature but want to be sure I'm correct in my understanding of the nature of pipelines. When I deploy via a pipeline from a Dev workspace, I see that my dataflows and datasets now appear as separate artifacts or "objects" in the Test workspace, which concerned me as some of our datasets may require imports of large volumes of data.

However, the linked section of this Microsoft KB entry seems to suggest that pipelines only copy metadata and not the actual data contained in the referenced data source. Is this correct?

For example, let's say a Dev workspace had the following:

2 dataflows connected to a SQL database;
1 dataset that models the 2 dataflows, which result in 500,000 imported records;
1 report that uses the dataset above.

If this is then deployed to the Test stage, does this mean that the 500,000 records are only stored ONCE or am I now working with the original 500,000 and duplicated 500,000 records for a total of 1 million across my workspaces?

jeffshieldsdev · ‎07-07-2021

The intent is you only use the prod workspace for consumption...dev is for development and test is for testing. All downstream dataflows, datasets and reports should always be consuming from the prod workspace. Only promote to prod what has been tested.

There no additional cost associated with storing data in all three environments--depending on your data source there may be other costs associated with extraction, but using pipelines enables you to test without overwriting prod.

An option to limit data is to create a parameter is in your dataset (like "DevelopmentMode", and set to "dev", "test", or "prod"). Have your queries check this parameter and when "dev" only import a small numer of rows (I use 10), if "test" then import a medium-high amount (I use 10,000), and if "prod" or blank, don't impose a filter at all.

You can then assign deployment rules to set the parameter in each stage's Workspace in the pipeline settings.

View solution in original post

jeffshieldsdev · ‎07-07-2021

Correct, data is not copied. Each Workspace only has the data contained within it--be it dataflows or datasets. The idea is to just work with assets within one environment: dev, test or prod.

When promoting to Test existing data may be retained if there are no structural changes; you'll have to refresh the dataflows and datasets in Test to populate them.

arpost · ‎07-07-2021

Thanks for the reply, @jeffshieldsdev. Is there a recommended method, then, of working with a single dataset in a central workspace rather than having each workspace contain data in a dataset? I thought parameterization might be a viable option, but it sounds like the pipeline would actually "undo" that by requiring each workspace to contain its own dataset.

Just trying to think about long-term performance and not consuming data storage unnecessarily.

Oh, and to confirm I'm understanding, when you said the following:

@jeffshieldsdev wrote:
Correct, data is not copied. Each Workspace only has the data contained within it--be it dataflows or datasets. The idea is to just work with assets within one environment: dev, test or prod.

you were saying "the idea is to ensure all assets are contained within a single workspace" as opposed to saying "the idea is to store data in one workspace and have other stages reference data in that workspace?"

jeffshieldsdev · ‎07-07-2021

The intent is you only use the prod workspace for consumption...dev is for development and test is for testing. All downstream dataflows, datasets and reports should always be consuming from the prod workspace. Only promote to prod what has been tested.

There no additional cost associated with storing data in all three environments--depending on your data source there may be other costs associated with extraction, but using pipelines enables you to test without overwriting prod.

An option to limit data is to create a parameter is in your dataset (like "DevelopmentMode", and set to "dev", "test", or "prod"). Have your queries check this parameter and when "dev" only import a small numer of rows (I use 10), if "test" then import a medium-high amount (I use 10,000), and if "prod" or blank, don't impose a filter at all.

You can then assign deployment rules to set the parameter in each stage's Workspace in the pipeline settings.

Do pipelines replicate data or just metadata about the source dataset?

Helpful resources

Fabric certifications survey

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024