Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
arpost
Advocate V
Advocate V

Do pipelines replicate data or just metadata about the source dataset?

I'm interested in using the new Deployment Pipelines feature but want to be sure I'm correct in my understanding of the nature of pipelines. When I deploy via a pipeline from a Dev workspace, I see that my dataflows and datasets now appear as separate artifacts or "objects" in the Test workspace, which concerned me as some of our datasets may require imports of large volumes of data.

However, the linked section of this Microsoft KB entry seems to suggest that pipelines only copy metadata and not the actual data contained in the referenced data source. Is this correct?

 

For example, let's say a Dev workspace had the following:

  1. 2 dataflows connected to a SQL database;
  2. 1 dataset that models the 2 dataflows, which result in 500,000 imported records;
  3. 1 report that uses the dataset above.

If this is then deployed to the Test stage, does this mean that the 500,000 records are only stored ONCE or am I now working with the original 500,000 and duplicated 500,000 records for a total of 1 million across my workspaces?

1 ACCEPTED SOLUTION

The intent is you only use the prod workspace for consumption...dev is for development and test is for testing.  All downstream dataflows, datasets and reports should always be consuming from the prod workspace.  Only promote to prod what has been tested.

 

There no additional cost associated with storing data in all three environments--depending on your data source there may be other costs associated with extraction, but using pipelines enables you to test without overwriting prod.

 

An option to limit data is to create a parameter is in your dataset (like "DevelopmentMode", and set to "dev", "test", or "prod").  Have your queries check this parameter and when "dev" only import a small numer of rows (I use 10), if "test" then import a medium-high amount (I use 10,000), and if "prod" or blank, don't impose a filter at all.

 

You can then assign deployment rules to set the parameter in each stage's Workspace in the pipeline settings.

View solution in original post

3 REPLIES 3
jeffshieldsdev
Solution Sage
Solution Sage

Correct, data is not copied. Each Workspace only has the data contained within it--be it dataflows or datasets. The idea is to just work with assets within one environment: dev, test or prod.

 

When promoting to Test existing data may be retained if there are no structural changes; you'll have to refresh the dataflows and datasets in Test to populate them.

Thanks for the reply, @jeffshieldsdev. Is there a recommended method, then, of working with a single dataset in a central workspace rather than having each workspace contain data in a dataset? I thought parameterization might be a viable option, but it sounds like the pipeline would actually "undo" that by requiring each workspace to contain its own dataset.

 

Just trying to think about long-term performance and not consuming data storage unnecessarily. 

 

Oh, and to confirm I'm understanding, when you said the following:


@jeffshieldsdev wrote:

Correct, data is not copied. Each Workspace only has the data contained within it--be it dataflows or datasets. The idea is to just work with assets within one environment: dev, test or prod.

 


you were saying "the idea is to ensure all assets are contained within a single workspace" as opposed to saying "the idea is to store data in one workspace and have other stages reference data in that workspace?"

The intent is you only use the prod workspace for consumption...dev is for development and test is for testing.  All downstream dataflows, datasets and reports should always be consuming from the prod workspace.  Only promote to prod what has been tested.

 

There no additional cost associated with storing data in all three environments--depending on your data source there may be other costs associated with extraction, but using pipelines enables you to test without overwriting prod.

 

An option to limit data is to create a parameter is in your dataset (like "DevelopmentMode", and set to "dev", "test", or "prod").  Have your queries check this parameter and when "dev" only import a small numer of rows (I use 10), if "test" then import a medium-high amount (I use 10,000), and if "prod" or blank, don't impose a filter at all.

 

You can then assign deployment rules to set the parameter in each stage's Workspace in the pipeline settings.

Helpful resources

Announcements
LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors