Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
OneWithQuestion
Post Prodigy
Post Prodigy

Violin Plot: why does the density plot extend past the values?

I'm using the violin plot and I like it a lot.

https://appsource.microsoft.com/en-us/product/power-bi-visuals/WA104381947?tab=Overview 

 

 

However, I see that the density plot extends well beyond the actual range of the data points.

 

That is not something I've experienced with other violin plots, but, perhaps I do not understand the purpose of this?

 

Thanks!

 

 

1 ACCEPTED SOLUTION
dm-p
Super User
Super User

Hi there (and thanks for liking the visual!),

 

I got an email from someone with a similar question around the same time as this post and we've discussed offline. I assume you're this person, but I'll fill this out for anyone else who might come across the question and wonder the same thing.

 

Firstly, there's a good post on Stack Overflow that explains the issue. Whilst this question is for the Seaborn library, the concepts still apply.

 

The run-off is due to the Kernel Density Estimation (KDE) plot used to smooth your distribution. If we just stop at the end of the min/max, we run the risk of miscommunicating the modality of your data, so the KDE is projected outwards, based on the trajectory of your data to a convergence point. Sometimes, the KDE doesn't fully resolve to this point due to floating-point issues in Javascript and we choose a sensible cut-off point to stop. Sometimes this produces a straighter line than intended in the tail-off but still lets the halves converge (I'm continually looking into this).

 

Some other things to consider (bearing in mind that everyone's data is going to be specific to their individual use cases):

 

  • By default, the visual uses Silverman's rule of thumb to calculate a bandwidth it thinks is reasonable. This is not always great, depending on the modality of your data but gives us somewhere to start when trying to produce a 'one size fits all' visual in these cases.
  • In cases where this doesn't do a great job, it can pay to look at the bandwidth applied. The default setting applies the same bandwidth to all categories based on the entire data set. In 1.1.0 we introduced the ability to apply the bandwidth to individual categories, if you are using them (either using the rule of thumb calculation or allowing you to specify manual values). This will 'tighten-up' each category appropriately, but might not be the right fit for your data as a whole.
  • Beyond the auto-calculation, you can opt to manually specify the bandwidth for the entire visual or on a category-by-category basis. Typically, the lower the bandwidth, the more peaks you'll get in your data.
  • It can pay to take the default bandwidth (you can obtain by selecting KDE Bandwidth in the Tooltip menu and hovering over the violin) and modifying it to see how the plot responds for your data.

For example, here's the tooth-growth dataset with the default bandwidth across all categories (this gives a bandwidth of 7.9):

 

pQPLfEk

If I apply this by category, this will calculate bandwidths of 4.8, 5.69 and 4.11 respectively, e.g.:

 

sdUtRHb

You can see this looks a little better for this particular use case, but I'd still consider what tihs might do for a different set of data if I'm splitting into categories.

 

If I really want to tighten-up the chart, I can reduce the bandwidth for all categories to 1, e.g.:

 

pILfxFp

So, my plots converge a little closer to the ends, but it's harder (but not impossible) to discern the modality of each category. For visuals with more data points (these only have 20 or so in them for each category), the plot can get a bit busy and may not serve the story you're trying to tell.

 

I have considered a 'clamping' option but have chosen not to implement at this time. I've also had this issue raised today, which I assume has sprung from this post/email discussion. I'll take a look at and consider for a future version as well.

 

Anyway, I hope that this clarifies things a bit and possibly offers some additional options for anyone using the visual.





Did I answer your question? Mark my post as a solution!

Proud to be a Super User!


My course: Introduction to Developing Power BI Visuals


On how to ask a technical question, if you really want an answer (courtesy of SQLBI)




View solution in original post

1 REPLY 1
dm-p
Super User
Super User

Hi there (and thanks for liking the visual!),

 

I got an email from someone with a similar question around the same time as this post and we've discussed offline. I assume you're this person, but I'll fill this out for anyone else who might come across the question and wonder the same thing.

 

Firstly, there's a good post on Stack Overflow that explains the issue. Whilst this question is for the Seaborn library, the concepts still apply.

 

The run-off is due to the Kernel Density Estimation (KDE) plot used to smooth your distribution. If we just stop at the end of the min/max, we run the risk of miscommunicating the modality of your data, so the KDE is projected outwards, based on the trajectory of your data to a convergence point. Sometimes, the KDE doesn't fully resolve to this point due to floating-point issues in Javascript and we choose a sensible cut-off point to stop. Sometimes this produces a straighter line than intended in the tail-off but still lets the halves converge (I'm continually looking into this).

 

Some other things to consider (bearing in mind that everyone's data is going to be specific to their individual use cases):

 

  • By default, the visual uses Silverman's rule of thumb to calculate a bandwidth it thinks is reasonable. This is not always great, depending on the modality of your data but gives us somewhere to start when trying to produce a 'one size fits all' visual in these cases.
  • In cases where this doesn't do a great job, it can pay to look at the bandwidth applied. The default setting applies the same bandwidth to all categories based on the entire data set. In 1.1.0 we introduced the ability to apply the bandwidth to individual categories, if you are using them (either using the rule of thumb calculation or allowing you to specify manual values). This will 'tighten-up' each category appropriately, but might not be the right fit for your data as a whole.
  • Beyond the auto-calculation, you can opt to manually specify the bandwidth for the entire visual or on a category-by-category basis. Typically, the lower the bandwidth, the more peaks you'll get in your data.
  • It can pay to take the default bandwidth (you can obtain by selecting KDE Bandwidth in the Tooltip menu and hovering over the violin) and modifying it to see how the plot responds for your data.

For example, here's the tooth-growth dataset with the default bandwidth across all categories (this gives a bandwidth of 7.9):

 

pQPLfEk

If I apply this by category, this will calculate bandwidths of 4.8, 5.69 and 4.11 respectively, e.g.:

 

sdUtRHb

You can see this looks a little better for this particular use case, but I'd still consider what tihs might do for a different set of data if I'm splitting into categories.

 

If I really want to tighten-up the chart, I can reduce the bandwidth for all categories to 1, e.g.:

 

pILfxFp

So, my plots converge a little closer to the ends, but it's harder (but not impossible) to discern the modality of each category. For visuals with more data points (these only have 20 or so in them for each category), the plot can get a bit busy and may not serve the story you're trying to tell.

 

I have considered a 'clamping' option but have chosen not to implement at this time. I've also had this issue raised today, which I assume has sprung from this post/email discussion. I'll take a look at and consider for a future version as well.

 

Anyway, I hope that this clarifies things a bit and possibly offers some additional options for anyone using the visual.





Did I answer your question? Mark my post as a solution!

Proud to be a Super User!


My course: Introduction to Developing Power BI Visuals


On how to ask a technical question, if you really want an answer (courtesy of SQLBI)




Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.