Hi all, apologies in advance, this is going to be long...just because there are a lot of questions and potentially I'm not doing something quite right.
I'm hoping that dataflows with the enhanced compute engine can solve a general class of problem we have - which is joining data from different databases. Power Query is very slow at this because the query can't be pushed down - since it crosses multiple databases. My hope was that the new "enhanced compute engine", which I believe stores entities not only as CSV files but also in a backend database, would help make these types of operations fast and easy. Microsoft promoted this with "Now you will be able to efficiently combine and transform different data sources and manage TB’s of data using dataflows. With the new engine, calculation of computed entities can now be up to x20 faster when working with large scale data volumes." Dataflows June 2019 Feature Summary
Having seen this, I started testing on a very basic scenario - load billing data from one database, and patient data (which can join to the billing data) from a different database. Specifically, I did the following steps:
In my mind, the two entities are NOT very big. The Billing entity has about 5.6 million rows, and the patient data is 3.7 million rows. I'm not sure how to estimate how much space these take up in the enhanced compute engine, other than noting that if I load the entities directly in to Power BI desktop, I see that billing shows as 1.26 GB and patient as 345 MB.
In other words, not that many rows, and way less than the "Terabytes" size in the announcement. In general - this is not working, and is leading to a ton of questions. Can anyone help answer these?
I apologize if these are answered somewhere - I searched and couldn't really find details. I was really hoping this would work - because we have multiple compelling reasons why being able to join the results of queries against multiple databases would be SUPER helpful. But so far not really having any luck.
Data is loaded to the compute engine's sql db as part of the ingest flow. All data loaded to entities in a workspace backed by a capacity with the engine enabled are loaded to the sql db
You can only control this by which workspaces are backed by capacities with the compute engine enabled. The allocation setting in the capacity admin page is about memory allocation. This is memory that is set aside for the engine and not able to be used by the tabular models on the capacity. because the sql db's are not in memory objects, the size of the data loaded to dataflows is not related to the memory allocation. instead the DB size consumes the 1TB of disk storage associated with each capacity.
I don't know. Its hard to make definite or even likely statements about performance without more detail.
See first answer. No there are no easy metrics here. That would be nice.
see first answer. since the DB is not an in memory object, this only happens as part of query execution. both power query and, i assume, the compute engine sql server have memory paging systems.
Could we get somebody from MS to answer those questions?
These are real life questions not covered by any docs and info about dataflows and real life performance is scarse. My test, even though less conslusive lead to completely giving up on this service as it was too slow to use. Would love to look into it again once there is more visibility what works and how so that we can set expectations properly.
Hi @Anonymous ,
I would recommend that you contact the author of dataflow under the document, they can answer your questions more comprehensively and professionally.
Two more questions I've come up with:
1. Is there anyway to see the full M code for an entity? In Power BI desktop I can click on the "Advanced Editor" button to see the full code for a table. In the service, it looks like the only way to see all code is to go through "step by step" and copy the code one step at a time?
2. How would I migrate a dataflow from a "development" workspace to a production workspace?
I would really love to see many of these questions answered @Anonymous as I'm experiencing similar headaches to you, very frustrating are all the lengthy waiting periods working with PowerQuery in the service for such things like preview results (inparticular with merge steps), endless issues eith refresh timeouts, and even to simply do the validation required to save & exit out of the PowerQuery 'edit' screen can be herendous for me depending on the complexity of what i'm doing.. but even something as simple as changing a few data types can have me waiting literally an hour+ for the query to validate and save.. hair-pulling stuff I'm sure you'll agree). On average I've experienced about 80% if time watching spinning wheels, and 20% actually getting anywhere with progressing a relatively simple dataflow model (with entitites of around the same size as yours 6M - 12M rows (we're on P1 Capacity however)) towards completion.
Besides all that.. all I have to offer you currently is that yes - you can actually view your M query inside the PowerQuery editor in the service - just right-click on your table in the 'Queries' pane and choose 'Advanced Editor'.
Best of luck getting some helpful 'real-world experience' responses to these questions!
Completely agree. I have exactly the same issues.
Such a great feature (dataflows), but when you try to build more logic it gets painfully slow.
this has to change!
Power BI release plans for 2023 release wave 1 describes all new features releasing from April 2023 through September 2023.
Make sure you register today for the Power BI Summit 2023. Don't miss all of the great sessions and speakers!
We had a great 2022 with a ton of feature releases to help you drive a data culture.