Hey, Data People! After being behind with my Alteryx Challenge (to finish all Weekly Challenges by the end of my DS training), I’m finally back on track! I managed to catch up over the weekend and I finished another set of challenges. Well, almost, I’m leaving one or two for when I’m more competent…
As promised, here’s the summary of tools I’d used and some tips for solving Data Cleansing challenges.
Know your challenge
I think I’ll repeat that one quite often… Even though some of the tasks are fairly straightforward and don’t need too much brainpower to figure them out, it’s a good practice to write down what you need to do.
I tend to start by describing what my data set is. This is super helpful if there’s more than one starting set so I don’t need to click on different icons to know where things are. Once that’s done, I organise what to do with the fields. The organisation might take a form of verbal steps, drawings, crossing-out, whatever works and requires the minimum effort. For more complex problems, I’d also write down the functions I expect to use.
It might sound like an extra step before building the flow but it actually gives you something specific to build. So know your challenge to know your flow.
Text To Columns – While doing my Data Cleansing challenges, I’ve found that Text To Columns can split the text also to rows! This was a revelation to me that helped me out quite a bit. If you want separate fields and know how many separate expressions are in the text, select ‘Split to columns’ in the configuration and make sure you selected the right number of columns. In case the number of expressions varies between the rows and/or you prefer to treat them as the same field but different options, select ‘Split to rows’ in configuration.
Extra tip: Whatever you put as delimiters (\s – whitespace, \t – tab, other separators – you can find the list on Functions help web page, and any specific characters) will be used to split the text. ‘2018/03/05 13:00’ with delimiters set to ‘/\s:’ (the order is irrelevant) will be split to ‘2018’, ’03’, ’05’, ’13’, ’00’.
Substring and FindString combination – A few times, I’ve found myself in need of getting a part of a string (text) that started after a certain character combination. To do that I used a combination of Substring and FindString functions.
Substring(text, start, [length]) returns part of the text argument starting at start until the end of the text unless length is specified. The important thing to remember is that the position within the string starts from 0.
FindString(text, target) will return the position of first character of target within text. If the target is not found, the function will return -1.
Let’s say you have a text field ‘Capacity: 5,000’ and you want to get the number part of it. There are quite a few rows in your data and each has a different numerical value. You could count which position is that of the first digit but I’m too lazy for that… Instead, I can use FindString([text field], ‘: ‘) to find where the separator starts. It’s easier to count the length of the separator (2), add it to the returned value from FindString and use it with Substring function:
Substring([text field],FindString([text field],': ')+2)
Apart from the mentioned tricks, I used quite a lot of tools mentioned in my first post. Whenever in doubt, try to search for a specific functionality or have a read through the functions to know more or less what they’re doing. The functions are organised by topic which helps a lot.
My solutions may not be the most optimal options but they work. Feel free to contact me if you need anything.
I was doing some data preparation challenges lately but we have just finished a session on macros as part of our DS training, so I might squeeze in macros before the data prep blog. Stay tuned for more!