08 NOVEMBER 2018 /

Data Science Skills: Web scraping javascript using python

There are different ways of scraping web pages using python. In my previous article, I gave an introduction to web scraping by using the libraries:requests and BeautifulSoup. However, many web pages are dynamic and use JavaScript to load their content. These websites often require a different approach to gather the data.

In this tutorial, I will present several different ways of gathering the content of a webpage that contains Javascript. The techniques used will be the following:

  1. Using selenium with Firefox web driver

  2. Using a headless browser with phantomJS

  3. Making an API call using a REST client or python requests library

TL;DR For examples of scraping javascript web pages in python you can find the complete code as covered in this tutorial over on GitHub.

Update November 7th 2019: Please note, the html structure of the webpage being scraped may be updated over time and this article initially reflected the structure at the time of publication in November 2018. The article has now been updated to run with the current webpage but in the future this may again change.

First steps

To start the tutorial, I first needed to find a website to scrape. Before proceeding with your web scraper, it is important to always check the Terms & Conditions and the Privacy Policy on the website you plan to scrape to ensure that you are not breaking any of their terms of use.

Motivation

When trying to find a suitable website to demonstrate, many of the examples I first looked at explicitly stated that web crawlers were prohibited. It wasn’t until reading an article about sugar content in yogurt and wondering where I could find the latest nutritional information inspired another train of thought where I could find a suitable website; online supermarkets.

Online retailers often have dynamic web pages that load content using javascript so the aim of this tutorial is to scrape the nutritional information of yogurts from the web page of an online supermarket.

Setting up your environment

Since we will be using some new python libraries to access the content of the web pages and also to handle the data, these libraries will need to be installed using your usual python package manager pip. If you don’t already have beautifulsoup then you will need to install this here too.

pip install selenium
pip install pandas

To use selenium as a web driver, there are a few additional requirements:

Firefox

I will be using Firefox as the browser for my web driver so this means you will either need to install Firefox to follow this tutorial or alternatively you can use Chromium with Chrome.

Geckodriver

To use the web driver we need to install a web browser engine, geckodriver. You will need to download geckodriver for your OS, extract the file and set the executable path location.

You can do this in several ways:

  1. Move geckodriver to a directory of your choice and define this the executable path in your python code (see later example),

  2. Move geckodriver to a directory which is already a set as a directory where executable files are located, this is known as your environmental variable path. You can find out which directories are in your $PATH by the following:
    • Windows
      Go to:
      Control Panel > Environmental Variables > System Variables > Path
    • Mac OSX / Linux
      In your terminal use the command:
       echo $PATH
      
  3. Add geckodriver location to your PATH environment variables

    • Windows
      Go to:
      Control Panel > Environmental Variables > System Variables > Path > Edit
      Add the directory containing geckodriver to this list and save

    • Mac OSX / Linux
      Add a line to your .bash_profile (Mac OSX) or .bash_rc (Linux)

        # add geckodriver to your PATH
        export PATH="$PATH:/path/to/your/directory"
      

      Restart your terminal and use the command from 2. to check that your new path has been added.

PhantomJS

Similar to the steps for geckodriver, we also need to download PhantomJS. Once downloaded, unzip the file and move to a directory of choice or add to your path executable, following the same instructions as above.

REST Client

In the final part of this blog, we will make a request to an API using a REST client. I will be using Insomnia but feel free to use whichever client you prefer!

Scraping the web page using BeautifulSoup

Following the standard steps outlined in my introductory tutorial into web scraping, I have inspected the webpage and want to extract the repeated HTML element:

<div data-cid="XXXX" class="listing category_templates clearfix productListing ">...</div>

As a first step, you might try using BeautifulSoup to extract this information using the following script.

# import libraries
import urllib.request
from bs4 import BeautifulSoup

# specify the url
urlpage = 'https://groceries.asda.com/search/yogurt'
print(urlpage)
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
# find product items
# at time of publication, Nov 2018:
# results = soup.find_all('div', attrs={'class': 'listing category_templates clearfix productListing'})

# updated Nov 2019:
results = soup.find_all('div', attrs={'class': 'co-product'})
print('Number of results', len(results))

Unexpectedly, when running the python script, the number of results returned is 0 even though I see many results on the web page!

https://groceries.asda.com/search/yoghurt
BeautifulSoup - Number of results 0

When further inspecting the page, there are many dynamic features on the web page which suggests that javascript is used to present these results. By right-clicking and selecting View Page Source there are many