Solved: Re: Best approach to split datasets and reports

Emerick · ‎04-03-2022

Hello there,

I recently started working for a client, and the current top priority is to define the strategy to adopt regarding the distribution of datasets, reports and workspaces inside of the Power BI Service (they are using a Premium capacity).

Basically, this client deals with data from various thematics (calls, sales, maintenance interventions, etc.) that is used to feed different reports.
For the moment, things are not done the correct way, since every report has its own dataset, even though many of them could share the very same dataset, thus avoiding unnecessary redundancy.

One important point to note is that it is exepected that some reports will use data coming from different thematics, for example crossing sales data with calls data.

For the moment, I am thinking about creating a single workspace containing all the datasets, with each dataset dealing with its very own thematic (one dataset for Sales, another one for Calls, etc.), and then create one workspace per thematic (with only reports in it). These workspaces would use shared datasets from the "dataset workspace".

Then, whenever a report would need to cross data coming from different thematic, we would use the "DirectQuery for Power BI datasets and Azure Analysis Services" functionality to handle this need. I don't know yet about how frequent crossing dataset would be needed, but from what I heard it should be quite often.

The reason I am proposing to isolate each thematic in specific datasets is to avoid data redundancy.
I believe that the DirectQuery feature would go along pretty well with this idea whenever it's required, but I am not sure about the restrictions we could face using this method. Especially (but not only) regarding RLS and paginated reports that would need to use two datasets at the same time.
Also, the reason I am not quite sure about this method is that I don't have much experience with DirectQuery over datasets, so I am not fully aware of the restrictions I would be facing later, and I feel like using this feature that way is not why it was developed for in the first place.

The other option I am thinking of is to have one dataset per thematic, plus extra datasets crossing many thematics (ie : "Dataset Sales & Calls"). This option would make things a bit easier maybe, at the cost of the higher number of datasets, and data redundancy (even though we are planning to be using dataflows).

So finally, I have two main questions :

1) Is creating a single workspace for all datasets the right approach ?
Or should I use a "one thematic = 2 workspaces" logic (ie : one workspace for "Sales dataset", one workspace for "Sales reports") ?

2) Is relying on the DirectQuery functionality on datasets to cross-reference them a safe bet ?
Or are there big limitations that make this approach a bit risky or complicated ?
Because for me, the main point of this "one dataset per thematic" approach is to avoid having common tables several times in several datasets (but maybe it's not that much of a problem, especially by using dataflows).
In case this is not the right approach, what would you recommend ? Having more datasets with more tables, even if it means having table "redundancy" between models ?
Or even having one huge dataset with all tables used by all reports ?

Thank you for your feedback, and do not hesitate to tell me if my totally wrong in my way of thinking !

Anonymous · ‎04-06-2022

It really comes down to what the business requires. If the data being made available for reporting purposes has crossover points then I would be looking to merge them together and filter out what's not useful so they are only working with information that can add value.

In my experience it comes down to what information needs to be displayed (does it cross themes? if so - merge it together), as well as the rate of refresh that will determine impacts server side.

As an extra note that I didn't cover before - having a single large dataset and allowing people to 'connect to the dataset' doesn't cause slowness in design/creation. This may be slightly wrong but it's my understanding the connected PBIX files does not store any of the data/transformation. That's done on the original PBIX file, and subsequent PBIX files that are connected are incredibly small (example: I have a 280mb file as my main dataset, but every connected PBIX file is less than 500kb). This also helps manage single source of truth which is a common business issue.

View solution in original post

Anonymous · ‎04-03-2022

There's a lot to digest and determine what is the best method moving forward for you and the client. If I can offer my feedback it would be to have an appropriate planning session with content creators (and even stakeholders) to identify the tables you're going to utilise, where the crossover occurs, and looking for opportunities to streamline and consolidate.

My personal experience - I've been fortunate to be part of implementation of PowerBI and it's been a changing and adaptive program of work. It started with native query implementation, and then import/transformation of specific requirements which led to multiple datasets being created. Our solution was to create a large dataset but optimise aggressively with Power Query (to the point of doing validation checks in Power Query instead of DAX, and reviewing DAX columns periodically to see if they can be added to Power Query). This led to ~86% reduction in PBIX file size.

My experience leads to a recommendation to identify crossovers and where possible build a single dataset to direct connect PBIX files to which should give you a single source of truth. Each direct connected PBIX can service specific needs, but it means everything you have is consolidated in 1 location. Performance would be impacted by the amount of data and timeframe etc. that you need to bring in, as well as sheer number of tables.. but so far I've been impressed by PowerBI's ability to handle things (I'm at about ~40 tables and 3-5 years of data and it's still performing quite well).

Emerick · ‎04-04-2022

Thanks for the reply !

When you're saying "build a single dataset to direct connect PBIX files to which should give you a single source of truth", you're suggesting to have only one main dataset for all reports in the end ?

If so, I am not against this idea, but to me I would two main problems :
- Huge data model, not so easy to understand (even though we can create model views to isolate thematics)

- When business users will need to create reports on their own by connecting to the dataset, they will see a huge amount of tables and columns, which could be intimidating whenever they need to create a basic report based one thematic only (but I might be wrong)

By the way, I totally agree on the process being an adaptive program, and that our first solution might be evolving over time. I just want to take the most appropriate choice in the first place, and ideally the most adaptive one !

Anonymous · ‎04-04-2022

Yes I would be recommending a larger dataset.

With your response it sounds like there's another layer in the organisational structure that I haven't accounted for. My understanding now is that you're trying to design a reporting solution for content creators who will then utilise that to create reports for users to consume.

I would still be of the belief that a large dataset is the best way to go. There's a large number of variables, but limiting a scenario due to user intimidation would only be valid if their feedback clearly indicates that (e.g. too complex/difficult to use). The only way to really determine that is feedback sessions with them to understand what best solution can be implemented for them to capitalise on.

Emerick · ‎04-05-2022

Once again, thanks for the feedback !

We had a few meetings regarding this topic, and someone from the team has been explaining that a huge dataset (with all tables, including billions-of-rows tables) could potentially be a bad practice since the simpliest report, needing only one subset of data, would target that big dataset, which we would be harmful in terms of performance.

I don't know your point of view on this claim ?

After all the discussions we already had internally (and that are still going), I believe the path we will be taking is many "extended datasets", which means not one dataset per thematic, but also with tables from other related thematics. Even though this way will imply having the same tables in many datasets, it will be easier to use the dataset for everyone, because many information will be accessible, without having ALL data at the same place.

Do you believe this "hybrid" approach is a good one ?

Once again, thanks for taking some of your time !

Anonymous · ‎04-05-2022

It's definitely a valid concern especially if you are looking at billions of rows of data. It may be better to explore an alternative solution prior to PowerBI (for instance the next step i've been informed of for my journey is going to be looking at understanding Azure Data Lake and how that can be utilised).

My experience is with sub-100 million rows of data and if you don't have the appropriate hardware and network to support it you're going to hit walls. The hardest part is initial design and effectively limiting scope creep (for instance you may have access to billions of rows of data, but what's the point of bringing that in when people are only interested in the past 3-6 months for a date range? Limiting the date range using parameters will vastly reduce the size you're working with).

Personal experience again - Realistically I still utilise a mix of datasets with some redundancies but that is to manage data needs of my customers covering 'the beginning of time', short term data (max 6 months), and finally real-time data. Each of those datasets have some redundancies, but most notably they have different refresh rates (real-time has 30 minute refreshes, short term daily, and 'beginning of time' has incremental refresh established looking to update the past 1 month every week to find retrospective changes).

Having the same tables in many datasets is not so much of a problem "IF" your transformation is the same - you can optimise and cleanse a table through Power Query fantastically, and then just open the query editor and copy/paste the code straight over so you're working with the same refined table ready for deployment.

Emerick · ‎04-06-2022

Thanks

I get your point, but for now I don't have much information about the reason some table are very big, but will investigate later to make sure that no data is being unnecessarily loaded !

Regarding your last point about having the same table many times, I would go even further by using data flows, which we are going to do ! This way, transformations will always be centralized.

Anyway, would any of you both be willing to provide a "summary" answer about my initial question (when to split datasets, when not to), so that I can validate it as a "good" answer, and maybe help other people in the future ?

Thanks to you both !

Anonymous · ‎04-06-2022

It really comes down to what the business requires. If the data being made available for reporting purposes has crossover points then I would be looking to merge them together and filter out what's not useful so they are only working with information that can add value.

In my experience it comes down to what information needs to be displayed (does it cross themes? if so - merge it together), as well as the rate of refresh that will determine impacts server side.

As an extra note that I didn't cover before - having a single large dataset and allowing people to 'connect to the dataset' doesn't cause slowness in design/creation. This may be slightly wrong but it's my understanding the connected PBIX files does not store any of the data/transformation. That's done on the original PBIX file, and subsequent PBIX files that are connected are incredibly small (example: I have a 280mb file as my main dataset, but every connected PBIX file is less than 500kb). This also helps manage single source of truth which is a common business issue.

Shahfaisal · ‎04-03-2022

From wha you are describing, I strongly feel the need for a Semantic Layer/Model. See the link blow for details.

https://docs.microsoft.com/en-us/power-bi/guidance/center-of-excellence-business-intelligence-soluti...

I am not a fan of "relying on the DirectQuery functionality on datasets to cross-reference them". if you are needing to link two different datasets for your reporting requirements, then it is a strong indication that those datasets need to be one dataset.

Emerick · ‎04-04-2022

Thanks for the reply !

In my understanding, Power BI datasets reprensent a semantic layer, so this is already what we are planning to do, except we're not sure about creating :

- One dataset per thematic (a cross-reference them with DQ)

- One dataset per thematic + other datasets that combine each others (ie : sales + calls)

- One big main dataset with all tables inside of it

From your last sentence, you seem to be suggesting to have only one big dataset, or at least datasets that combine data from multiple thematics, am I right ?

Shahfaisal · ‎04-05-2022

@Emerick wrote:
From your last sentence, you seem to be suggesting to have only one big dataset, or at least datasets that combine data from multiple thematics, am I right ?

Correct, if you are finding the need to query/report/analyze across multiple data sets. The downside of one large/complex model is that it is a bit difficult to use and navigate various objects. If you have worked with Analysis Services, you know that Perspectives were very helpful in breaking down a complex model into small *virtual* models without psychically creating separate models.

Though perspectives are not supported out of the box in Power BI Desktop, there is a solution. See this post https://data-marc.com/2020/08/18/power-bi-visual-customization-using-perspectives/

Emerick · ‎04-06-2022

Yes, I know about perspective, such a shame we cannot have them "easily" on Power BI, but hopefully someday... 🙂

Anyway, thanks for the feedback.
Someone from the team is arguing that a very big dataset with all data (some tables having billions of rows) could be a very bad idea, performance wise, especially when "little" reports will need to query that huge dataset.
I got to admit that I don't see why it would be bad in terms of performance, since the report will only query a subpart of the data model ... or do you think that person is right ?

To me, the huge drawback of big datasets (without perspectives) is the navigation experience.

Best approach to split datasets and reports

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024