Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
bvy
Helper V
Helper V

Need Measure to count nearby points on a Scatter

Quite simply I want a measure that counts the number of nearby points (relative to a given point) on a Scatter Plot visual. It could use an arbitrary radius (like 5 units). I would use this to assign color or size to points that are close together. 

 

Points in the red circle below have many nearby neighbors, and so the measure should return a higher number (so I can associate it with a "hot" color).  

bvy_1-1598645905851.png

 

bvy_2-1598645933479.png

 

Is this doable? 

 

 

 

1 ACCEPTED SOLUTION
Greg_Deckler
Super User
Super User

@bvy - Yes, definitely. However, I thought there was a built-in K-means cluster finder

https://docs.microsoft.com/en-us/power-bi/visuals/power-bi-visualization-scatter

 

Anyway, you could probably always do this:

Measure =
  VAR __radius = 5
  VAR __x1 = MAX('Data'[X])
  VAR __y1 = MAX('Data'[Y])
  VAR __id = MAX('Data'[ID])
  VAR __Table = ADDCOLUMNS(FILTER(ALL('Data'),[ID]<>__id),"Distance",SQRT( (__x1 - [X])^2 + (__y1 - [Y])^2) )
RETURN
  COUNTROWS(FILTER(__Table,[Distance]<=__radius))

Warning, this could eat up some processing cycles!!


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

View solution in original post

17 REPLIES 17
daxer-almighty
Solution Sage
Solution Sage

Hi there.

How many points in total do you have? Asking because you've got 2 choices: either you calculate this stuff on the fly (could take quite a time and eat up resources) or you pre-calculate the distances and then just retrieve them.

Hi @daxer-almighty . Can't really pre-calculate because the count is very subject to selected filters. Open to ideas though... 

If, in your dataset, you have N points, then you need a table with 3 columns: StartPoint, EndPoint, Distance. So, you need - depending on how easy the DAX must later be - either N*(N-1) or N*(N-1)/2 rows in the table. You can then connect it to other table(s) and it'll be filtered. You can also leave it disconnected and use DAX to retrieve the points and distances you need. I can't see filtering as being any obstacle to implement this...

@daxer-almighty Okay, well this is interesting and worth a little exploration.

 

The raw, unfiltered dataset of points (coming from SSAS) could potentially have a million rows. In the PBI Scatter visual, though, with filteres and slicers applied, I wouldn't expect more than a few thousand points to be displayed at one time (at most). Since the dataset of distances you're proposing is on the order of N^2, I don't see it as practical to do this in SSAS on the full dataset. So are you thinking of implementing this in memory somewhere using DAX, on the filtered dataset? 

 

Definitely interested in hearing more. 

You can easily make the calculations speedier by just storing the distances of points to their nearest neighbours. You could, for instance, take the 10^6 points and either:

1) pre-calculate the distances to points within a certain radius, or
2) pre-calculate the distances to the 10 or 100 or 1000... nearest points.

How to use this? Easy. Just create DAX that will, for any point, look for an entry in the calculated distances table, and if it doesn't find it there, just calculate it as you normally do. This way you can certainly speed up the calculations. Looking things up - maybe through a relationship - may be faster than calculating. You just have to experiment.

@daxer-almighty 

 

In that case I really don't even need the distance. It's only used to determine if the closeness requirement is met. So two points get a record in the table only if they're sufficiently close. If each point has, on average, 10 neighbors, then that might be manageable to pre-process. Then it would just be a matter of doing a count where the point appears in either column (assuming we're not storing (A, B) and (B, A) in the table).

 

I'm hazy on how to implement this. A separate table in SSAS? Process via an SP in the back end database? DAX? 

 

Thoughts @Greg_Deckler ? 

@bvy - You know, I just had another thought on how to greatly speed up the original method of doing the distances dynamically, you could do this:

 

Measure =
  VAR __radius = 5
  VAR __x1 = MAX('Data'[X])
  VAR __y1 = MAX('Data'[Y])
  VAR __id = MAX('Data'[ID])
  VAR __Table = ADDCOLUMNS(FILTER(ALL('Data'),[X]<=__x1+__radius && [X]>=__x1-__radious && [Y]<=__y1+__radius && [Y]>=__y1-__radius),"Distance",SQRT( (__x1 - [X])^2 + (__y1 - [Y])^2) )
RETURN
  COUNTROWS(FILTER(__Table,[Distance]<=__radius))

 

So basically any point that is going to be within the specified distance by definition has to fall within the square defined by the radius so I *think* that should be a performance gain because you are not doing the distance calculation on every single point every single time. And here is another thought, you could technically get rid of the distance calculation all together and just do:

 

Measure =
VAR __radius = 5
VAR __x1 = MAX('Data'[X])
VAR __y1 = MAX('Data'[Y])
VAR __id = MAX('Data'[ID])
VAR __Table = FILTER(ALL('Data'),[X]<=__x1+__radius && [X]>=__x1-__radius && [Y]<=__y1+__radius && [Y]>=__y1-__radius)
RETURN
COUNTROWS(__Table)

 

Sure, it's not as "scientificky" as using the actual distance between points but it would be fairly accurate, probably within most margin of errors!

 


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

@bvy - OK, I gave this a legitmate shot trying to create a Power Query based upon the same logic as my last reply with essentially "drawing" a box around the point and finding the other points in that box. I am quite certain that this is probably not the most elegant or efficient way of doing it because my Power Query fu is relatively weak compared to someone like @ImkeF or @edhans , who maybe they might want to jump into this thread and help optimize this. Anyway, I attached the PBIX below sig. Tables are Table (4), Table (5) and Query2. 

 

Table (4)
let
    #"Table (4)" = let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("PY7BEQAhCAN74e1D9PDOWhj6b+NMUB/ZSYIOuItKEaOiuLTlKoXUl5sU0rOc1rqJxugP0Qy8tgTyy1kC+YO1BPK888zH3w+K+xq3tr1VcWRnBUb8", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, X = _t, Y = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"X", Int64.Type}, {"Y", Int64.Type}})
in
    #"Changed Type"
in
    #"Table (4)"


Table (5)
let
    Source = #"Table (4)",
    #"Added Custom"=Table.AddColumn(Source,"Matches",each Query2([X],[Y],5)),
    #"Expanded Matches" = Table.ExpandTableColumn(#"Added Custom", "Matches", {"ID", "X", "Y"}, {"Matches.ID", "Matches.X", "Matches.Y"})
in
    #"Expanded Matches"


Query2 (it's a function)
(x as number, y as number, r as number) as table =>
let
    Source = #"Table (4)",
    Table = Table.SelectRows(Source,each [X]<=x+r and [Y]<=y+r and [X]>=x-r and [Y]>=y-r)
in
    Table

 

So, the total number of rows is dependent on how many matches are found. The obvious downside to this approach is that you won't be able to make the radius for matches dynamic based upon user input within the report. Data refresh will be slower, your model will be larger, your visuals should be more performant over a pure DAX solution. 

 

To bring @ImkeF and @edhans up to speed quickly. Lots of points in a scatter chart. Want to control the size of those points based upon how many "close neighbors" there are. I created a pure DAX solution that dynamically finds close points using standard distance formula then made improvements, @daxer-almighty suggested a Power Query approach, this message here tries to combine the two approaches into a single solution. 🙂

 

The attached file, Page 4 compares the various solutions. There is a difference even in the same dataset between the distance and box approaches.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

Hi @Greg_Deckler ,

the only thing I can think of is to optimize the crossjoin in Table (5) with the approach that Chris Webb has posted lately: 

   #"Added Custom"=Table.AddColumn(Source,"Matches",each Query2([X],[Y],5)),

https://blog.crossjoin.co.uk/2020/08/30/optimising-the-performance-of-power-query-merges-in-power-bi... 

 

Don't think that the Table.SelectRows has any potential for optimization.

 

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

Hmm, read the Chris Webb article. Was actually trying to avoid a full cross join, but maybe it is faster to do the full cross join, do the calculation and filter down versus doing the filtering as part of the join essentially.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

Oh, I might got this one wrong @Greg_Deckler .

But it could be worth a try..

 

Imke Feldmann (The BIccountant)

If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!

How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries

@Greg_Deckler Good stuff guys. Will pick this up in the morning when I'm legitimately "at work" again. Thanks for the brainpower. I like where this is going... 

@bvy - Seems like to implement this in the data model you would have to start with a GENERATE of the original table with itself I guess, which is the N^2. Then you would just calculate the distance for each row. But, if you are going to go down that route, you probably might look at doing it in Power Query with a Merge query and a custom column in Power Query. So, at the end of the day your data load would be slower, your data model bigger and performance in your visual should be improved. The degree to which each of these changes and whether one is better than the other is dependent. Obviously, if you are Pro and this technique blows your data model up beyond the 1GB limit, then that's bad. So, it's really going to come down to your specific set of circumstances.

 

Basically, you are trading data load/refresh speeds and data model size for visual performance/easier calculations.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

If you order the points somehow (does not matter how), then you'd have to check N*(N-1)/2 distances, which still is a lot. Not sure how SQL would cope with this (depends on how powerful the server is) but maybe you could do it in Python instead? I'll have to check myself if that's feasible.
Greg_Deckler
Super User
Super User

@bvy - Yes, definitely. However, I thought there was a built-in K-means cluster finder

https://docs.microsoft.com/en-us/power-bi/visuals/power-bi-visualization-scatter

 

Anyway, you could probably always do this:

Measure =
  VAR __radius = 5
  VAR __x1 = MAX('Data'[X])
  VAR __y1 = MAX('Data'[Y])
  VAR __id = MAX('Data'[ID])
  VAR __Table = ADDCOLUMNS(FILTER(ALL('Data'),[ID]<>__id),"Distance",SQRT( (__x1 - [X])^2 + (__y1 - [Y])^2) )
RETURN
  COUNTROWS(FILTER(__Table,[Distance]<=__radius))

Warning, this could eat up some processing cycles!!


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

@Greg_Deckler  This works pretty well. Thanks so much for taking the time. A few observations/questions: 

 

1. What is the purpose of this filter? 

[ID]<>__id

 When I plug that measure in for Size, the Scatter drops points that have 0 neighbors. 

 

2. Also, if I want to have the measure also respond to the report filters, is that just a matter of changing FILTER(ALL to FILTER(ALLSELECTED? (I'm new to DAX.)

 

3. And if so (#2), will that potentially help performance since the measure is operating on fewer records? (To be honest, performance wasn't bad in my sample app but we'll see how it scales.) 

 

Thanks again. 

@bvy - Well, I didn't actually test the code...

 

The intention of the [ID]<>__id was that it would exclude the current point (itself) from the calculation of distance. Eliminating a single point is not a huge performance boost, but, whatever. Not sure why that wouldn't work though, but whatever.

 

Yes, if you want to preserve slicer selections, use ALLSELECTED versus ALL and yes that should improve performance as well in situations where you have slicer selections.

 

Finally, I believe if you put something like your Category into your Legend that if you then clicked the ellipses on your scatter chart that you would have to option to automatically find clusters and that, I believe, uses K-means.


@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!:
Mastering Power BI 2nd Edition

DAX is easy, CALCULATE makes DAX hard...

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors