Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
t-zhguan
Frequent Visitor

DistinctCount Approximation or Exact

I'm connected to a Kusto cluster using DirectQuery and noticed that when I used DistinctCount to create a measure and visualize said measure, the results are sometimes off by +/- 1-2%. When I use visual filters to manually look at top/bottom n values, the actual number of distinct entries match the query results I get from Kusto.

 

Just wondering if the results from the distinctcount function is exact or approximate and if it's normal to expect some slight margin of error when using aggregate function on larger data sets through directquery.

11 REPLIES 11
Greg_Deckler
Super User
Super User

Do you get the little yellow ! in your visual that indicates that data is being sampled?


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

Nope, I do not see the yellow exclamation mark.

 

I have 3 tables in a page, one that shows the measure, which is just DISTINCTCOUNT(column), and two other columns to slice the data. The other two have the unsummarized data. If I create visual filters on the two unsummarized one and sort them in reverse order to show the top 1 and bottom N, it does not match the value that DISTINCTCOUNT returns, but does match my Kusto query.

 

For example, the distinctcount function returns 1832, but if I create visual filters showing the top 1 and bottom 1832, the values do not match. However, if I create visual filters showing the top 1 asc and bottom 1853 desc (the Kusto query result), then the values match.

 

 

Hmm, really tough to troubleshoot that one without data. But, if it is a large data set or has sensitive data, that can be difficult. Approximately how many rows are in your data set?


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

I've attached a screenshot visualizing the issue. The first row shows 1832 distinct values, but my Kusto query returns 1852. As well, If I do a top 1 desc and bottom 1832 desc, the values do not match, but I instead do a bottom 1852 desc, the values match.

ScreenCapture.png

 

Total dataset has ~1.3M rows. Count of the first row of data returns a bit under 9000 rows.

OK, looks like it text. The documentation on DISTINCTCOUNT isn't very detailed. I wonder if it is possible that DISTINCTCOUNT is not counting on something distinct that your other process is. like "ff" and "FF" might be the same or possibly trailing characters, etc. Have you done a clean and trim on that when you import it? Or is this live? I can't remember.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...

Yea, I originally thought that too, but sometimes PowerBI shows more and other times it shows less. The data is cleaned and trimmed and I'm using DirectQuery instead of importing since the data source is updated on a regular basis and expected to grow quite large over time.

Does not really address the issue why DISTINCTCOUNT is not functioning the way you want but you may want to look at using COUNTROWS(VALUES(table[field])) as an alternate way to get the nubmer of unique values.  I believe this is also more efficient in terms of performance espeically with large data sets. 

 

If this returns the same as distinctcount then the issue is probably with filter context somewhere that is filtering out some of the data points.  You also may want to check your query to ensrue there are no errors in any calculated columns dropping records.

I tried COUNTROWS(VALUES(Column)) and it gives the exact same number as DISTINCTCOUNT(Column).

 

I've removed all other fields with the exception of the date range slicer and using the same method as previously described, the DISTINCTCOUNT(Column) gives 9117 rows and the visual filters of the tables with the column has a total of 9085 rows (same as Kusto query result). 

 

There are no report level filters (only using visual level filters and a few page level filters on other pages).

Do you have any joins to other tables? If there is a bridge or lookup table that does not contain all the options it probably is excluding those records that don’t match up.

Tried it on a simple Kusto query and I'm still getting a discrepancy between the distinct count of Power BI and distinct count of Kusto.

 

Kusto Query (returns 3479):

Table_Name
| where Status == "Success" and Timestamp<=todatetime("2018-07-06") and Timestamp>=todatetime("2018-06-07")
| project Column1, Column2
| distinct Column1
| count

 

When creating a new dashboard in Power BI and using the following Kusto query as a data source, 

 

Table_Name
| where Status == "Success" and Timestamp<=todatetime("2018-07-06") and Timestamp>=todatetime("2018-06-07") 
| project Column1, Column2

 If I use a table to visualize Count(Distinct) on Column1, the result is 3465.

 

Tried this on a couple of columns, for one, Power BI shows a higher number than Kusto, for another, it showed a lower number.

 

If I do a count instead of Count(Distinct), the values match.

No clue. I would export the results of the distinct values from each and do a sort and compare in excel and find the differences in order to understand what is going on.  

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.