Re: How power BI handle large DataSet?

Maggi029 · ‎10-18-2021

Because I'm using elastic search as my database, I'll need to create historic data visualisation. Every week, I will push about 30MB of json data into the elastic search index, and I have imported my data from elastic search to power bi using Python scripting, and everything is working fine. After a few weeks, the data inside elastic search has grown to around 3GB, and the power bi desktop now throws the following exception when trying to refresh the data.

What is best way to handle this scenerio, will i need to change input method or what is best way to hanle this.

Code:

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import pandas as pd


client = Elasticsearch([{'host':'localhost'}])
my_datasets1=[]

for index in client.indices.get('mydata_*'):
  s = Search(using=client, index=index)
  if 'myindexname_' in index:
    my_datasets1.append(pd.DataFrame([hit.to_dict() for hit in s.scan()]))

metrics_dataset= pd.concat(my_datasets1)

@nicpenning can you please look into this

nicpenning · ‎10-21-2021

Sorry Maggi029, I am not sure how Python ties into your PowerBi integration.

I query Elasticsearch directly with PowerBi and use pagination to get all results on GBs of data.

However, I recommend looking into Elasticsearch Data Transforms that allow you to aggregate large sets of data into a new index with the value you need instead of working with raw documents.

https://www.elastic.co/guide/en/elasticsearch/reference/current/transform-overview.html

An example is have 10,000 documents with the data such as:
{"computer_name": "computer1", "organzation" : "Finance", "country" : "United States of America"}

{"computer_name": "computer2", "organzation" : "Finance", "country" : "United States of America"}

....
{"computer_name": "computer10000", "organzation" : "Finance", "country" : "United States of America"}

You can use transforms to get counts, averages, etc.. to reduce the data down to just 3 documents.

{"unique_computer_names": "700", "organzation" : "Finance", "country" : "United States of America"}
{"unique_computer_names": "100", "organzation" : "Human Resources", "country" : "Nigeria"}
{"unique_computer_names": "200", "organzation" : "Information Technology", "country" : "France"}

Working at something like this at scale can take billions of documents to thousands of documents when you narrow in on specific data you wish to display high level metrics in PowerBi.

This might be a lot of work and research but you can dramatically reduce large datasets by creating summaries of the data and display what is needed. Not a quick solution by any means but it helps PowerBi out by using summarized data.

Maggi029 · ‎10-21-2021

Hi @nicpenning

Thanks for look into this,

Previously, as you suggested (Solved: Re: Getting Data from Elastic Stack with REST with... - Page 2 - Microsoft Power BI Communit...), I used the rest api to import the data from elastic search. However, I was unable to refresh the data in the PowerBI online service using this method because recursive calls turn this datasource into dynamic data, and PowerBI fails to refresh dynamic data. As a result, I rely on the Python method. Do you believe my request can still be fulfilled using the REST API, and if so, could you please elaborate?

nicpenning · ‎10-22-2021

I am not familar with the dynamic data source or the online PowerBi service, so I won't be much help here. I hope you can find a solution soon!

TomMartens · ‎10-18-2021

Hey @Maggi029 ,

from my perspective, you are facing or will be facing some challenges these challenges are

Power BI file size limitations for upload, these limits are ~2GB for PRO licensing and ~10GB for Premium per capacity licensing
Using Python inside the Power BI for data wrangling/data shaping

This is how I would tackle this

Separate the content from the data by using two different pbix files, one that only publishes the data model to the service and a 2nd one that publishes the content. The content pbix is using the Power BI dataset as a data source. One advantage of this approach is that you gain more flexibility if architectural changes are happening.
I would not use python for data shaping / data wrangling inside Power BI. This is simply for the following reasons. Using python (or R) for data shaping / data wrangling inside Power Query requires the on-premises gateway in personal mode if you are planning for scheduled data refresh. Some organizations (e.g. the one I'm working for) are not allowing the use of gateways in personal mode. Requesting one can lead to discussions with your Power BI service administrators.

Some architectural considerations

To overcome the file size limit that can be uploaded I would consider moving my python code to Azure, e.g. creating an Azure Automation script or writing an Azure function. This code then creates JSON documents. Each document contains data for a day or for a month, whatever you consider useful. The JSON document can be stored inside an Azure blob store. Then the files will be picked up by Power Query. You can create your data model using just a small number of JSON documents, publish the pbix file to the service, copy all the JSON documents to the blob container, manually refresh the dataset, done!

Hopefully, this provides some ideas on how to overcome your challenges.

Regards,

Tom

Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany

Maggi029 · ‎10-19-2021

Hi @TomMartens

Thanks for your reply , I think you suggestion have some useful benefits. I have few dounts needs to clarify on your suggestion ,since im new to this method.

1. Separate the content from the data by using two different pbix files - What is usecase for this method, is it also supports size management?

2.. Azure Automation script- According to my understanding, you're suggesting that you write an Azure automation script that pulls data from elastic search and converts it into small documents and save it in blob storage, which you can then connect to PowerBI using Azure Blob Storage. Correct me if i was wrong.

Thanks

TomMartens · ‎10-19-2021

Hey @Maggi029 ,

in regards to 1
next to be able to develop the data model and the content in parallel, it's important to know, that you can't download the pbix file once you have enabled incremental refresh.
Incremental refresh does not directly support handling large files, but you can reduce the duration of a data refresh as only the incremental will be added to the model. This will become a little bit more complex if the source are files, but there are many posts out there, which are describing the proper configuration. The one of @GilbertQ is my favorite: How you can incrementally refresh any Power BI data source (This example is a CSV File) - Reporting/...

in regards to 2

Yes, accessing files from a blob store, is more simple than publishing a single large pbix file to the Power BI service.

Regards,

Tom

Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany

How power BI handle large DataSet?

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly