Sets and groups and rock and roll

by Gwilym Lockwood

It’s teaching week at The Data School, and I’m doing sets and groups. Quite a lot of it is straightforward, but one useful trick I’ve learned is using combined sets to get around those times when level of detail calculations might not work and/or might be too complicated to keep track of. It involves a few calculated fields, but it’s worth it for the sets appeal.

Since this is all about sets and groups and rock and roll, I’ve downloaded my Spotify listening history (data and workbook available here). It’s a nice, simple dataset; each line is a play of a song, with the band, album, song title, and date/time played. It looks like this:

sets-blog-data

I created a lot of sets based on my favourite bands, the bands that my dad likes, the overlap between those two groups (maybe I’m just getting older, but this was a lot more than I thought), my most played bands and songs in the last year, etc. This is really straightforward, and is explained much better on the Tableau help pages.

But what about other measures? Simply counting the number of songs played by band doesn’t tell me whether I love everything that band has done, or whether I love one song but don’t care about the rest. How do I work out which bands I’ve listened to a fair bit, but mostly just due to one song? The criteria: the band has to have ten or more plays, and one of their songs has to constitute 50% or more of those plays.

I added a couple of calculated fields to make things easier. Some bands have songs with the same song title (e.g. Athletics, Tiny Moving Parts, and We Made God all have a song called “II”), so I created the field SongBand, appending the song title and the band name to create distinct song information. Secondly, I added calculated field “Plays” by simply counting the field Song, and “Plays per Band” to do the same thing but fixed at the band level. These two fields show the same information if you’re just plotting the number of times I’ve played songs by a particular band, but it’s important for later on.

Drag the Band field up to the rows, the Plays field up to the columns, SongBand to the detail shelf, and stick a filter on so that SongBand is >= 10. Then, run a table calculation on Plays to calculate the percent of total across the table. That’ll look like this:

sets-blog-percent-plays

Problem is, we need the table calculation as a stored field for later use. We could just drag the green Plays field from the columns into the measures card, but that’ll lose the levels of detail. So, let’s do it manually by creating a calculated field:

sets-blog-percent-plays-cf

The next thing to do is work out which of these bands (or comedians, since I listen to Doug Stanhope on Spotify a fair bit as well) have one song with at least 50% of that band’s plays. The easiest way of doing this is to create a TRUE/FALSE calculated field to work that out, then drag it onto the view. We can work that out like this:

sets-blog-one-song-band-tf-cf

And dragging it onto the view looks like this:

sets-blog-graph-to-manually-create-set

There we go, there’s four bands that I’ve listened to a fair bit, but where most of those listens have been from one song: American Football (Never Meant, because it’s the epitome of 90s Midwestern emo), Floral, Grayscale, and Wil Wagner (Laika, because I like sad songs about dogs in space).

To create a set of these bands, you can just highlight the four bars in the TRUE part of the graph, right click, and create a set. Done.

Except we’re not done, because it’s not dynamic. Let’s add in my listening data from 2014 and 2015 as well. This adds more bands to the TRUE side of the view, because there are more bands that I’ve listened to ten or more times where at least half of that is just one song…

sets-blog-add-2014-and-15

…but if I drag the previously created set onto the colour shelf, all these bands are coloured as out of the set. The set was manually created based on the 2016 data and will not update itself automatically:

sets-blog-manual-group-doesnt-update

Not so useful.

We can create a dynamically-inclusive set by calculating it, rather than plotting it and manually selecting the things we want from the view. This is quite difficult, because you want to fix the number of plays to be 10 or more at the band level, but use the song level to count the number of plays to work out what percentage that is at the band level.

But luckily, instead of writing a really complicated calculated field, you can create sets and then combine them across levels. Sets bomb.

First, let’s create a set of bands where the bands have at least ten plays. Easy enough; just right click on Band in the dimensions, and create a set based on those criteria:

sets-blog-bands-10-plays

Second, let’s create a set of bands where one song has at least 50% of the plays. Remember that Percent Plays calculated field form earlier? This took the number of plays, fixed at the SongBand level, divided by the number of songs at the band level. To be 50% or more, we want Percent Plays to be >= 0.5. But if we just write that, it’ll only look for bands where all the songs I’ve played from that band are >= 0.5. That’d work if I’ve played two songs from one band an equal number of times, but that’s it. Instead, we need to use a MAX() function. This excludes the other songs from consideration, because if a song has been played 50% or more times, it’s automatically the most played song from that band. So, we can create our set as follows:

sets-blog-bands-max-over-50

Then, all we have to do is combine those two sets and keep the shared values:

sets-blog-one-song-bands-combined-sets

That’s it. We now have a set that’ll dynamically update the members of that set regardless of how much data we add or remove.

sets-groups-one-song-band-filter-works-dynamically

Oh, the joy of sets.