Web scraping, i.e. extracting data from websites, is one of the most frequent ways to collect data. This can be done manually (copying and pasting data from a website) or using an automated process. Frequently, web scraping requires access to tools such as Alteryx or knowledge of a programming language such as Python or R, but in this blog I’m going to discuss a couple of quick solutions that allow to import tabular data from web pages by using only Excel or Google Sheets.
In this tutorial I’m going to import the list of companies that comprise the S&P 500 stock index from this Wikipedia page.
Locating the tables in a webpage using its HTML source code
1 – First, let’s identify the tables on our selected webpage. On Firefox or Chrome, press F12 (⌘ + ⌥ + I on Mac) to bring out the Web Developer tools sub-window and select the Inspector tab. It can also be called via the browser’s menu from the “Developer tools” section:
- Firefox: Menu ➤ Web Developer ➤ Toggle Tools, or Tools ➤ Web Developer ➤ Toggle Tools
- Chrome: More tools ➤ Developer tools
- Safari: Develop ➤ Show Web Inspector. If you can’t see the Develop menu, go to Safari ➤ Preferences ➤ Advanced, and check the Show Develop menu in menu bar checkbox.
- Opera: Developer ➤ Developer tools
2 – Now let’s find the <table> tag that defines a table in HTML in the Inspector sub-window (see the highlighted section in the screenshot below). A table header is defined with the <th> tag, each table row is defined with the <tr> tag, and a table cell is defined with the <td> tag. Click the arrow icon in the top left corner of the Inspector to select an element from the page; its location will be highlighted in the Inspector.
The table we will export is highlighted in light blue, note also the highlighted source code next to the cursor in the Inspector to the right.
Importing HTML tables with Google Sheets
1 – Open a new spreadsheet in Google Sheets. In an empty cell enter the following formula:
=IMPORTHTML("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies", "table", 1)
The formula IMPORTHTML requires 3 inputs:
This is the URL of the page we will import data from. It should include the protocol (e.g. http:// or https://), and be enclosed in quotation marks. Alternatively, it can be a reference to a cell in your workbook that contains the relevant URL.
Google Sheets also gives you the option to import lists from a webpage. Search for a list element (either an unordered list, tag <ul>, or an ordered list, tag <ol>). Each list item starts with the<li> tag.
It can be either “list” or “table”, depending on the type of the webpage’s element that you want to import data from. It should also be enclosed in quotation marks. In this example, we are importing a table.
The index, starting at 1, identifies which table or list should be returned from the page’s HTML source. This is useful if your page contains multiple tables or lists.
2 – Press Enter and enjoy the imported data. This import is dynamic and will update automatically when new data is added to the table. That can be useful when scraping tables that are frequently updated, for example, results of sports competitions or elections.
3 – You can download this newly created data set as .xlsx, .csv. or .tsv for further manipulation, or connect to it directly from Tableau Desktop by selecting “Google Sheets” from the list of servers.
Importing HTML tables with Excel
Works in Excel 2007, 2010 & 2013
1 – Open a blank workbook and go to Data > Get Data from Web.
2 – In the dialogue box paste the URL you would like to access into the “URL” field and click OK.
3 – In the Navigator dialogue box that appears next you will see all tables from the requested page on the left side under the “Display Options” section. You can preview a table in the window to the right by clicking on the relevant table’s name (if there are several tables on the page).
4 – Once you have selected the table you would like to import, click “Load”. The table will appear in the sheet already formatted as a table.
5 – You can get more information about the table’s source and the last update by hovering over the name of the table in the “Queries & Connections” sidebar on the right. To refresh the table’s data, click on the refresh icon next to the table’s name.