For this challenge, we needed to parse the city name, state and zip code (if available) out of an address which was in a non-standard format with no commas. I decided to practice using RegEx to help me to complete this challenge, and this is what I came up with.
Here is the input data:
Firstly, we will get the zip code from the rows that have one. For this I used the RegEx tool and the expression: “.+(\d\d\d\d\d)”. This essentially looks through the string until a series of five digits in succession appear which it will then grab. As the zip codes were located at the end of the string, I just placed “.+” at the beginning which means one or more wildcards to cover everything up until the five digits. The output method for this was “Tokenize”, and I split to 1 column and renamed the field with the Select tool so that the result looked like this:
Now it’s time to extract the State name from the string. To do this, I used another RegEx tool with the same Tokenize Output method as before. Then I used the expression “\u\u” which simply checks for two uppercase characters that are next to each other. Make sure that “Case Insensitive” is unchecked! I used a Select tool again to rename the field and I also moved the order of the fields around.
Here is the result after this step:
For the last step in this solution, I wanted to make it easier to locate and extract the city name from the text. To do this, I used another RegEx tool and the Replace Output Method with the expression “Circle|Street|Drive|Road|Avenue|Ave|St|\u\u”. This expression will find any of the words stated (which are separated by | which means “OR”), as well as two uppercase characters that are next to each other (the zip code). I then replaced these with a “,” which resulted in this:
Now it was much easier to write an expression to locate the city name as they are all in between two commas. With another RegEx tool and the Tokenize Output Method again, I used the expression “,\s(.+)\s,”. This locates anything that is in between commas and spaces, which in turn allows us to extract the city name. After that, I used a final Select tool to rename the field to “City” and to get rid of the “Address Text” column.
Here is the final output: