Or: how knowledge is power when it is shared
Act 1: on dealing with “busy” data
On inspecting a dataset it becomes clear that the insight might be lost in how tightly packed together the data points are. It becomes difficult to separate the signal from the noise. Take this example – rat sightings in New York City.
Some form of aggregation could help in which the number of data points is reduced. Or a statistical operation could help to siphon off the points that do not play a significant role in the relationships under investigation. Both of these options can be applied to spatial or non-spatial data though the former is the focus of this blog.
Act 2: Smoothing out the visual representation of the data – density tools
A simple form of spatial aggregation with point data (no clue? Read my blog on spatial data here) is to shepherd the points into some form of zone. This could either be one that already exists (administrative area, for example) or one that is created (100m x 100m grid). The density of events then becomes the number of events in each zone divided by the area of the zone.
The overall view is simplified, the complexity in the data appears reduced.
However, the outcome is dependent on the underlying nature of the landscape. The size of the grid affects the view. The nature of change within the grid affects the view. Underlying attributes (population, for example) affect the view.
Therefore care should be taken in interpretation. Only general statements should be made – darker cells appear to have more events than lighter coloured cells.
Above are a couple of different grid sizes for the rat sightings data. I created these in Alteryx. To reproduce this view in Tableau a few additional steps are required (see Mark’s blog on this).
To restate this point for clarity – this is not what Tableau is doing when density marks are employed.
Act 3: kernel density estimation
Tableau is not performing a simple density by zone operation. Tableau is performing a type of kernel density estimation (KDE) (Kent Marten, via Twitter).
This is a reasonably complex spatial analytical procedure. It involves quite a lot of mathematics. Please have a look at some of the resources at the end if you’re interested in reading more.
In summary, for point data, it calculates the density of features within a specified neighbourhood around those features. Processing occurs across the entire dataset allowing the relative “importance” of each feature to be assessed.
Here’s a great description from ESRI (they make one of the most popular GIS softwares globally):
“Conceptually, a smoothly curved surface is fitted over each point. The surface value is highest at the location of the point and diminishes with increasing distance from the point, reaching zero at the search radius distance from the point.”
Two important elements are needed to perform the calculation.
- Bandwidth (the default search radius) – how far around each point do we look for other features?
- Kernel density function (the smoothing “shape”) – how do we define the distance decay (there is more than one way to do this)?
A smaller search radius reduces the size of the neighbourhood in which to look for other features. A larger radius will do the reverse.
Many GIS softwares contain tools to assess the most appropriate bandwidth to select based on the underlying spatial distribution of the points and correcting for spatial outliers.
The variation in kernel functions is shown below. The default linear kernel used by Tableau (Kent Marten via Twitter) is triangular. GIS softwares typically default to quartic for point data – this will give a slightly smoother appearance than the linear.
Act 4: What am I doing in Tableau?
Within Tableau, the user simply selects the density mark type. There are no options available to set the search radius (or any reference to this term).
However, there are options to change both the colour “intensity” and the mark “size”. These are altering the underlying procedure but the way in which this happens is hidden from the user.
From trial and error, it seems that altering “size” is affecting the choice of the bandwidth – larger size, larger bandwidth. Here is how it looks when I adjust the “size” using the rat sightings data.
The explanation on “intensity” from Tableau suggests that we can identify more “hot spots” by upping the intensity. My take is that it is reducing the number of colour bins available thus increasing the range of data contained in each one. I’ve raised my concerns about lack of data driven colour palettes here and I don’t accept that altering intensity allows us to identify more “hot spots“, even by eyeball.
Statistically, Hot-spot analysis (Getis-Ord Gi*) identifies significant hot (and cold spots) in the data given a set of weighted input features. Additional functions would need to be available in order to perform this calculation in Tableau.
Back to my concerns about density. None of the decisions about kernels or bandwidth are left up to the user in Tableau. This is problematic.
None of the detail of the processing is clear to the user. This is also problematic.
In order to make appropriate choices the user needs to understand both their data and the operations that they decide to perform. It doesn’t seem that both criteria are fulfilled with the density mark type.
- The Tableau density mark process is not documented with respect to kernel choice, bandwidth, and how these are adjusted in any changes made to “intensity” or “size”
- Without proper documentation (and understanding) of process, Tableau makes it very easy to make arbitrary changes to create a better aesthetic
- The appropriate use of the density mark is actually no different to the appropriate use of other analytical or statistical features. Trend lines, for example. None of these techniques should be applied without a proper understanding of the data and the statistical procedures
Check it out – read more here:
- Kent Marten (Tableau) explains density mark types
- On density and kernels (Geospatial analysis)
- How kernel density works (ESRI)
- KDE from a GIS perspective (ESRI)
- More on interpreting density output from ArcGIS, Eric Krause (ESRI)