Re: ODBC and Large Data Sets

jsperson · ‎02-20-2018

Hey folks,

I'm having an issue with Power Query (both Power BI and Excel) and am casting a wide net looking for options.

Here's my requirement - the data source has about 300M records. I have the option of using three different database engines - AWS Athena, AWS Redshift, or Presto. The problem that I'm having is with Power Query SQL generation using the ODBC driver. Let's say I select two attribute columns and two measure columns. If I group by both attribute columns, Power Query pushes down the group by SQL to the database and returns only a few records with teh measures aggregated appropriately. In my analysis I noticed that one of the measure columns had a bunch of null values that I want to exclude. So I placed a filter on that column. When I hit go Power Query proceeds to try to bring back the entire dataset (filtered, but that only removes 10% of the records) before doing the group by. This seems like a signficant deficiencey in the Power Query engine.

I've tested with Tableau and the same driver. Tableau creates a SQL statmenet like SELECT foo, SUM(foofact) FROM bar WHERE foofact IS NOT NULL GROUP BY foo. That's exactly what I would expect from Power BI.

Any suggestions appreciated...

Greg_Deckler · ‎02-20-2018

In the ODBC connection, expand "Advanced Options" and then slap in your SQL.

@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

jsperson · ‎02-20-2018

Thanks - I'm a huge Power BI fan so I'm not being flip here and I really do appreciate your response. Your answer just isn't acceptable for the typical Power BI user base. If the user knew how to write SQL/M then they really wouldn't need Power Query. We're talking about a pretty basic piece of SQL writing for Power Query (hard for an end user, but should be easy for a query tool).

Here's the query I'm looking for:

SELECT foo1, foo2, SUM(foo3) foo3, SUM(foo4) foo4 FROM bar WHERE foo3 IS NOT NULL GROUP BY foo1, foo2 LIMIT 1000

This is almost exactly what Tableau produces by default - no tweaks, not special connection string etc. One of the primary purposes of my effort here is to justify Power BI over Tableau. I love the integration of Power Query in both PBI and Excel. I just wish it wrote better queries...

Greg_Deckler · ‎02-20-2018

I think that there is another perspective here around justifying Power BI versus Tableau. Power BI comes with Power Query, which is an incredibly powerful data mashup/ingest/ETL tool. To get anything even remotely equivalent in Tableau you have to purchase something like Alteryx, which every shop that owns Tableau buys (at least the people that I know that own Tableau).

So, the focus with Power Query is not to write SQL Queries for people but rather allow them to do much more powerful data manipulation. In order to do that, it needs a starting point in terms of the data. The mechanism for that working is not geared toward pre-filtering the data but rather working with the data as defined. It allows for the pre-filtering of data via SQL if you really want but is perfectly capable of working without it on large rowsets.

I understand what you are saying but I think it doesn't take into consideration the entire picture. When you say you wish Power BI wrote better queries, it can write queries that are incredibly more powerful and better than Tableau. Tableau out-of-the-box is really incredibly limited in it's query capabilities. It is like working with tinker toys compared to Power Query.

@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

jsperson · ‎02-20-2018

>>It is like working with tinker toys compared to Power Query.

It's funny you say that. As I mentioned I'm a big proponent of Power BI. Unfortunately for me, the tinker toy tool (Tableau) is actaully returning data. Because of the way Power BI writes queries, it's just blowing up. I don't need to look at the entire picture. One tool (Power BI) just flat doesn't work for large data sets via ODBC without hand feeding it data via custom SQL. Even this is pretty hard because if I use a limit clause, Power BI then proceeds to wrap this and aggregate only on my limited number of records. I can then fully develop my report, but I have to remove the limit clause at some point at which time Power BI proceeds to try to download 300M records again thereby choking. Final victory - "tinker toy" Tableau...

Thanks again for your time. I was already pretty sure that there was not a good solution I was just looking for confirmation.

Edit: grammar.

jsperson · ‎02-20-2018

Here is a bit of background as to what happens when I actually try to use Power BI. I also have a discussion going on the Power Query board. They don't seem to have an answer either. The example below explains what happens when I try to filter and group by in sequence. I've included the SQL generated as well as the M code. This is mainly for future reference. I don't expect anyone to actually diagnose this. At some point though I'd love to talk to the query engine development team...

It is sending the following query when coming into the Power Query editor in preview mode (10 min wait):

SELECT foo1, foo2,...foo74 FROM bar

Note: NO LIMIT CLAUSE! Tableau uses a limit clause and returns in 5 seconds.

Then I ask for only my two attributes and two measures (2-3 minute wait):

SELECT foo1, foo2, foo3, foo4 FROM bar

Again: no limit clause

Then I filter where one measure is not null (another 2-3 minute wait):

SELECT foo1, foo2, foo3, foo4 FROM bar WHERE foo3 IS NOT NULL

Then I do a group by both attributes with SUM on both measures. I can't provide this query from Power BI as it blows up my client before finishing. It's trying to return all 300M records to my client so it's pretty obvious that it did about the same thing as immediately above except without whatever limits it normally uses.

Tableau sends the following, which returns in about 5 seconds:

SELECT foo1, foo2, SUM(foo3) foo3, SUM(foo4) foo4 FROM bar WHERE FOO3 IS NOT NULL GROUP BY foo1, foo2 LIMIT 1000

Here is the M code from Power BI:

let
Source = Odbc.DataSource("dsn=datalake-poc-athena-64", [HierarchicalNavigation=true]),
AwsDataCatalog_Database = Source{[Name="AwsDataCatalog",Kind="Database"]}[Data],
hmda_Schema = AwsDataCatalog_Database{[Name="hmda",Kind="Schema"]}[Data],
hmda_lar_Table = hmda_Schema{[Name="hmda_lar",Kind="Table"]}[Data],
#"Removed Other Columns" = Table.SelectColumns(hmda_lar_Table,{"action_taken_name", "agency_name", "applicant_income_000s", "loan_amount_000s"}),
#"Filtered Rows" = Table.SelectRows(#"Removed Other Columns", each ([applicant_income_000s] <> null)),
#"Grouped Rows" = Table.Group(#"Filtered Rows", {"action_taken_name", "agency_name"}, {{"Income", each List.Sum([applicant_income_000s]), type number}, {"Loan", each List.Sum([loan_amount_000s]), type number}})
in
#"Grouped Rows"

jsperson · ‎03-16-2018

We opened a support case and pursued it as far as possible. At this point there is no solution.

It seems that when working with ODBC sources Power BI just doesn't do much in the way of writing efficient SQL. The behavior when filtering then grouping is to filter at the source, but group on the client. Also the previews do not send limit clauses. Effectively this all means that Power BI can't work with datasets beyond a few thousand records with ODBC.

The development team did accept this as a potential improvement so there is hope that someday this capability will be added. Until then unfortunately I'll have to use a competing product to Power BI.

SamTrexler · ‎03-14-2019

I found this old post while trying to move my reports to use Athena/Parquet for a new client. I,too, and having performance issues with large datasets - and utilizing large datasetds is part of the point of going to Athena.

I've done some tracing with the ODBC administrator, and it seems the ODBC drivers retrieve one column at a time for each row - and inherently slow mechanism.

So if Power Query can't effectively fold the M statements into the query, as it does for SQL Server, then the combination of expensive queries with big rowsets and the inherently slow transmission via ODBC, it seems Power BI is not well suited to retrieving information from this big-data storage mechanism.

Has anyone found a solution to this?

jsperson · ‎03-14-2019

Thanks for the comment - you are right. Power BI tries to bring over the whole dataset to the client.

I'd be interested in revisiting this as well. Iam hopeful that Power BI will get a fix. I was a big proponent of Power BI until I ran across this problem and ended up dropping it almost entirely out of our practice. Tableau is working great because it simply writes queries that are more appropriate for large datasets.

yan · ‎03-21-2019

If you are on a recent release - not sure when it was added -, you can specify the clause to limit rows on previews in Query editor.

See below for an external BigQuery ODBC driver (from Google itself). This should be available for any driver.

I did test by looking at the SQL's BigQuery get:

if you don't set this, PBI won't know what synthax to use and hence would load the whole set
once set, PBI appended the SQL with LIMIT 1000

Should work for you too if you don't make any aggregation.

If you aggregate either in Editor or in charts, if PBI deems it's not possible to use query folding, it will load all and aggregate locally. The conditions for folding are very opaque (=undocumented).

If you are in luck, there may be a datatype with which PBI has no issue that is very similar to the "datatype-in-error".

For BigQuery, NUMERIC is problematic but FLOAT64 isn't so I can use a view that cast to float and folding works.

EDIT: I think the solution may be found somewhere here ! https://docs.microsoft.com/en-us/power-query/odbc

by trial-n-error

yan · ‎03-22-2019

Try adding SoftNumbers=true in the DataSource option i.e.

Odbc.DataSource("dsn=xxx", [HierarchicalNavigation=true,SoftNumbers=true])

It did the trick for me, PBI sent a proper sum() SQL

ODBC and Large Data Sets

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly