This week at The Data School, DS22 were given individual projects to collect data using either an API or by Web Scraping. My project this week was to use the Zoopla API to collect data on properties around my chosen location of interest. I decided I wanted to explore properties around my local train station, and whether their distance to the train station had any effect on the price of the property. The workflow below documents each step I took in order to get my desired output.
I first had to register on the website in order to receive an API key. I used a text input to input the web address and the API key, as well as 'page_number=$$' at the end of the string in order to collect data from multiple pages. I find the easiest way to collect data from multiple pages is to use a generate rows tool. This worked in this instance as I knew how many pages I wanted to return. According to the documentation there are 100 listings per page, and my original call returned just under 8500 records. So I generated 85 rows to bring back every page. The formula;
Replace([Base Url], '$$' , ToString([RowCount]))
will replace the dollar sign with the relevant page number.
In the first section of the workflow, I retrieved the data using the Download Tool. Within the Payload tab of the download tool, I specified the postcode I wanted to look at, the 'page_size' as stated in the API documentation, the longitude and latitude of the location I was interested in, and the radius within which I want to view properties (e.g. within a 5 mile radius).
Next, I Parsed the data to get a row id for each property, and values I would later use as the headers for each column. There were 43 rows for each Row ID, many pertained information I wasn't interested in, so I filtered out just the rows I wanted. I noticed in the 'post_town' column, many of the towns were input as 'London' rather than the actual town (Ealing). As I discovered which postcode these were for, I created a formula to replace 'London' with 'Ealing' which was:
IF [outcode] = "W13" THEN Replace([post_town], "London", "Ealing") else [post_town] ENDIF
I then added a spatial tool in order to see the properties on a map.
Within the original data that was retrieved, there were columns I was interested in that I was unable to filter out in the first filter tool because of the structure of the data. As shown in the image below, each property had information about pricing and each occurrence of the price changing. For each property, the amount of price changes was different, and within the price change headers, they had a value of '0, 1, 2' etc. depending on how many times the price changed. This meant, crosstabbing the data would have been inefficient and resulted in several columns, as the price header is not the same for each property.
To be able to use this data for my final analysis, I created a separate flow to draw it out. I filtered out just the price and percentage, and parsed out the value within that header. I decided I wanted to see just the most latest price change, and percentage increase or decrease to the previous price of the property. Therefore, in the summarize tool I grouped by Row ID and found the max value for each row. I then joined the summarize back to the regex tool as the summarize dropped out columns I needed.
With the regex tool I used the replace function and changed names of the headers to 'Price' and 'Percent' to make the strings the same in order to crosstab. The regex formulas I used were:
price_change.(\d+).price and price_change.(\d+).percent
I then crosstabbed and joined this back to the original flow on Row ID. Finally, I output the data as a hyper file.