cancel
Showing results for 
Search instead for 
Did you mean: 

Improve ODBC and Large Data Sets

The following two suggestions are features of Tableau that make it a better solution than Power BI  for working with larger data sets.

 

1. Power Query doesn't seem to be able to fold a filter and group by into the SQL. A simple query like this should be possible:

SELECT foo1, foo2, SUM(foofact1) foofact1, SUM(foofact2) foofact2

FROM bar

WHERE foofact1 IS NOT NULL

GROUP BY foo1, foo2

 

Instead Power BI is sending:

SELECT foo1, foo2, foofact1, foofact2

FROM bar

WHERE foofact1 IS NO NULL

 

Then I suspect that it would try the GROUP BY on the desktop. Unfortunately I never get that far because the desktop can't handle 300M records... I'd never expect it to handle that large of a data set. That's why the query should filter and group on the database.

 

2. Power Query should allow a LIMIT clause to be defined such that the preview and editor don't try to bring back more than X number of records. Preview definitly does not work when working with large data sets. Well, it does, but in my case it's taking 10 min for the data to come through. I believe Power Query is assuming that all sources can stream results. This is not the case. A LIMIT clause would help the database run lighter.

Status: Needs Info
Comments
Super Contributor

Hi @jsperson,

 

1.1 Did you try it out? I tested a similar SQL statement with 3.4M records successfully.

select storekey, salesquantity, sum(storekey) sk, sum(SalesQuantity) sq  from FactSales where StoreKey is not null group by StoreKey, SalesQuantity;

1.2 The maxmium of rows is 2 billion. I think 300M records won't be a problem. Please refer to service-get-data#considerations-and-limitations.

 

2. There is a Limit clause (TOP in SQL Server) when Power Query sends statements to the Server. 

2.1 What's your data source?

2.2 What's your expected LIMIT clause?

Improve_ODBC_and_Large_Data_Sets

 

 

Best Regards,

Dale

Established Member
Status changed to: Needs Info
 
Frequent Visitor

Dale, thank you very much for your reply. I'll provide more information that I hope will be helpful. This is an excerpt from my discussion on the Power Query Technet site.

 

Database: AWS Athena and AWS EMR (Hadoop) using Presto DB. I tried the Redshift drivers, but the Power BI driver has a bug where external tables won't display.

Client: Windows EC2 instance on AWS with 16GB RAM. Actual instance type is t2xlarge

Power BI: 2.55.5010.641 64-bit (February 2018)

Tableau: 10.5.1

 

Because I am having such a hard time with Power BI here I decided to attack the same requirement with Tableau. Every comment I make below about a needed feature or performance issue is addressed appropriately by Tableau. Note that Power BI is my tool of choice. I have about 2 hours of time on Tableau and hundreds on Power BI.

 

I'm pretty sure that I know where the "folding" isn't occurring - it can't seem to write a SQL statement with both a where and group by clauses. Here are my steps and the resulting SQL where available.

1. It is sending the following query when coming into the Power Query editor in preview mode (10 min wait):

SELECT foo1, foo2,...foo74 FROM bar

Note: NO LIMIT CLAUSE! Tableau uses a limit clause and returns in 5 seconds.

2. Then I ask for only my two attributes and two measures (2-3 minute wait):

SELECT foo1, foo2, foo3, foo4 FROM bar

Again: no limit clause

3. Then I filter where one measure is not null (another 2-3 minute wait):

SELECT foo1, foo2, foo3, foo4 FROM bar WHERE foo3 IS NOT NULL

4. Then I do a group by both attributes with SUM on both measures. I can't provide this query from Power BI as it blows up my client before finishing. It's trying to return all 300M records to my client so it's pretty obvious that it did about the same thing as immediately above except without whatever limits it normally uses.

Tableau sends the following, which returns in about 5 seconds:

SELECT foo1, foo2, SUM(foo3) foo3, SUM(foo4) foo4 FROM bar WHERE FOO3 IS NOT NULL GROUP BY foo1, foo2 LIMIT 1000

 

Here is the M code from Power BI:

 

let
    Source = Odbc.DataSource("dsn=datalake-poc-athena-64", [HierarchicalNavigation=true]),
    AwsDataCatalog_Database = Source{[Name="AwsDataCatalog",Kind="Database"]}[Data],
    hmda_Schema = AwsDataCatalog_Database{[Name="hmda",Kind="Schema"]}[Data],
    hmda_lar_Table = hmda_Schema{[Name="hmda_lar",Kind="Table"]}[Data],
    #"Removed Other Columns" = Table.SelectColumns(hmda_lar_Table,{"action_taken_name", "agency_name", "applicant_income_000s", "loan_amount_000s"}),
    #"Filtered Rows" = Table.SelectRows(#"Removed Other Columns", each ([applicant_income_000s] <> null)),
    #"Grouped Rows" = Table.Group(#"Filtered Rows", {"action_taken_name", "agency_name"}, {{"Income", each List.Sum([applicant_income_000s]), type number}, {"Loan", each List.Sum([loan_amount_000s]), type number}})
in
    #"Grouped Rows"

 

Edit: Changed salutation; realized that Dale responded, not Vicky - Sorry

Frequent Visitor

We created a support case and pursued it as far as possible. At this time there is no solution.

 

It seems that when working with ODBC sources Power BI just doesn't do much in the way of writing efficient SQL. The behavior when filtering then grouping is to filter at the source, but group on the client. Also the previews do not send limit clauses. Effectively this all means that Power BI can't work with datasets beyond a few thousand records with ODBC. 

 

The development team did accept this as a potential improvement so there is hope that someday this capability will be added. Until then unfortunately I'll have to use a competing product to Power BI.