Last week, DS12 learned about Regex. Regex is a language used to parse string formatted data. I remember when I first heard of Regex, I was really lost and I couldn’t figure why humankind invented something so shady and incomprehensible. Then a colleague told me about Regex in Google Analytics and I used some basics such as the ‘|’ character for the selection of one word OR another.
But as the class was going on, I understood that Regex is quite powerful in terms of the many different uses it allows. Whether you need to parse, clean, find and replace, create token to sort your data, or build customized Formulas, Regex can help.
Today, I will only have a look at how I used it for my project on last Friday, with the great help of Ollie. For that project, we were asked to update our application dashboard, by adding a new data set to it. I used my presentation about Starbucks and the macro-nutrients split. After considering different options as for what my new data set should be, I copied and paste some HTML code from a website page source. It seemed a good idea at that time, but I soon realized it would have almost been simpler to pick up a PDF file !
My data set is a piece of HTML code, therefore a one-column table, and I want to turn it into a 2 columns data.
Immediately, we can see that some cells are useless because it is only filled with coding bits. I only need the name of the beverage and lowest price available in the list (pic 1) so some cells are redundant.
What is good:
My data set has a pattern: every 6 cells, it comes back to a string data and I want to use those words to fill my first column. All the redundant elements are tags with crocodile brackets or ‘<>’ around. It is then easy to isolate. I then need the digits in a second column and finally, I want to delete the empty cells.
What I did:
First, I did some data cleaning, removing the code tags, I duplicated the digits in a new column then I removed the empty cells (see below).
The Regex Magic
Ollie was really helpful when he came with the idea of creating a row specifically for creating an ID number for each row, that would increment every time the data format of the cell in the first column was NOT a number. (pic 2)
Then, I only had to create an extra step to select only the minimum amount of price – since I only needed the price for the price for the smallest size of beverage.
I was then able to build that two-columns table, and use my set of data in a viz later. Thank you Regex, and thank you Ollie !