Using Python for Scraping

What is it?

Python is a popular programming language that you can use for several activities. In this lesson, you will gain experience using it to scrape data from a website. The lesson has been updated for Python3.

Why should I learn it?

Python is a server-side language that is used for data analysis as well as the basis for frameworks like Django. It is helpful to know more than one programming language, so you can choose the correct one for your needs.

What can I do with it?

Many of the techniques we have covered previously will be applicable in Python, things like strings, variables, data types, if statements and loops, but the syntax is different when applying these techniques in Python.

How long will it take?

Programming is something that you learn over time, through practice. But in a few hours, you will be introduced to the basic elements of Python and applications using it for scraping data from a website.

Getting Started with Scraping

This exercise uses the Python programming language and the BeautifulSoup Python library for pulling data out of HTML pages. We’ll write code in a Python file in a text/html editor, and then we’ll run the script using the Terminal. You can download Python 3 or update to the recent version of Python 3 from the Python Downloads page.

It is helpful to have knowledge of html and css concepts before you learn to scrape a web page. These are basic concepts that all students who have taken the first few modules of a Web Design class should know.

  • table – html code for introducing a table on an html page
  • tr – html code for table row
  • td – html code for table data
  • div – a section of a Web page. It can be identified further with classes or ids.
  • Basic understanding of how html is styled with attributes and inline and external css.

You will need to be able to look at html code and identify selectors (ids and classes) that will help the script find the data.

A few things about Python:
  • The program is space sensitive, so tabbing matters. When you paste the saved code, make sure it is indented in an identical manner.
  • Do not end the lines with a semicolon or other punctuation.
  • If you want to comment any lines of the code, precede each line with a # sign. Commenting allows you to add helpful instructions or descriptions to code that does not affect its ability to run. And, it can allow you to temporarily remove lines of code for testing.
  • If you get any error messages in the Terminal, read them. They will help you troubleshoot any problems.

These syntactical rules may not make sense right now, but will become clearer as we work through the exercises.

Python Code

Check your python version. Make sure you have python installed. We will be using Python3 (at the time of this tutorial, the current version is Python 3.9.0). You can find the Terminal on a Mac under Applications, Utilities. In the Terminal, type in the following code. The $ indicates the Terminal prompt, so you don't need to type that in.

$ python3 --version

Also, make sure you have the pip program installed. This allows you to install Python libraries that you will use during this lesson. Type this code into the Terminal

$ pip --version

If you don’t have pip, run this to download pip:

$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

Then this to install it

$ python3 get-pip.py

Install the following now with pip:

You may also have to upgrade pip3, if you get errors when doing the code below.

$ pip3 install --upgrade pip

$ sudo pip3 install beautifulsoup4
$ sudo pip3 install Requests

If the Terminal asks for your password, it is your computer password.

With your Terminal open and a new document ready in an html editor, use the Terminal to cd to a folder on your computer. Let's just mkdir from your root directory to make a folder called "scraper". Then cd into "scraper".

$ mkdir scraper
$ cd scraper

Now in your html editor, save the page in that folder named scrape.py. You will now be in the appropriate directory to run the file in the Terminal. You will use the following command to run the program:

$ python3 scrape.py

Find the code to the right and put in the scrape.py file.

  1. Let’s take a look at what is happening. First, we import several libraries that we will need. "import csv" allows us to create a csv at the end of the script. "import requests" allows us to handle url requests and BeautifulSoup lets us select html.
  2. Next, we establish the url of the site we are scraping. We put it in a variable named "url," so we can use it in the script. The next two lines create variables for the response and html. We will also use html in the script.
  3. The next two lines use BeautifulSoup to access the html (DOM) and make a selection.
  4. The "find" line is the key to this exercise. You have to go into the code and find a CSS selector that identifies where in the code you want to start scraping. The data is usually stored in a table (although it could be in a list). Find a containing element for the table or list and identify either a class or id. In this case, there was a parent div that had a class "mw-content-ltr". The code includes that information as an attr (attribute).
  5. Next we create an empty array named list_of_rows. This will contain all our rows of data until we are ready to write it to a file.
  6. Next there are nested for loops that go through each row of the table, extract the text from each cell and append it to the list_of_cells array. That is then appended to the list_of_rows array. That happens for every row of the table.
  7. The last few lines simply write the data from the final list_of_rows array to a file, in this case, we named it film.csv. After you run the script, you will find it in the same folder as your scripts.
  8. From the terminal, run:
    $ python3 scrape.py
    After a few seconds, the prompt should reappear. Check your folder for the csv and open it in Excel. You have created your first scraper!

Code Sample - Basic Scraper

Include this code in the scrape.py file you created. It will scrape the Academy-Award-Winning Films from Wikipedia. You will run this in the Terminal by going to the appropriate directory, using the Terminal commands to change directory (cd). You run it with $python3 scrape.py.

Exercise: : Modify the scraper above to scrape the contents of this page, a list of the top newspaper circulation in the world - https://www.infoplease.com/culture-entertainment/journalism-literature/top-100-newspapers-united-states .

Hint: You have to find a containing selector for the table you wish to scrape and modify the "find" statement. And, you have to change the url. Also change the name of the output file, so you don't overwrite your previous csv.

Also, you can rerun a command in Terminal by using the up arrow/down arrow keys. This allows you to scroll through the last several statements.

Multi-Page Scraper

Now that we have mastered the art of scraping a page, we can move on to automating the ability to scrape an entire site. Go to Texas Music Office’s Musician Registry site. The site is now an app with several pages listing more than 5000 musicians. Run the page to find the Austin Area musicians, and find the url for the 1st page.

https://gov.texas.gov/Apps/Music/Directory/talent/all/region/austin/genre/all/p1

Notice that the end of the url includes "p1". If you go down to the end of the page and click on Page 2, you will see the url to change at the end to "p2". This is the pattern we will recognize later in getting all the pages.

See the scraper code below that creates a loop to run through the 1st page, scrape it for the h2 and li items in the html and append it to the csv.

Put the code on the following page into a new python file. You can name it scrape_music.py.

The code is commented using the # to describe each section. This code gets the first page of the Musician's Registry for the Austin Area. Go through the steps to understand how it uses a loop to find all the divs that contain the text.

Code Sample - Multi-Page Scraper

Include this code in the scrape_music.py file. It will scrape all the artists on the first page of the Musician's Registry that are in the Austin Area. You will run this in the Terminal by going to the appropriate directory, using the Terminal commands to change directory (cd). You run it with $python3 scrape_music.py.

Scrape All Pages

Now we want to program the scraper to look at each page. The code to the right puts everything in a loop that goes through each page of the site (1-104). Notice that each page has the same url, just a different number. Use the range method to tell it which pages to get. When you run the program, it may take a few minutes to scrape all the items.

This code also uses an if statement to detect when a link is provided in the html and gets the url instead of the text. Notice the change in indentations that are associated with the for loops and if statement.

The output uses the "a" statement (append), instead of the "w" code in the previous example (write). The output file is initially written with "w" before the loop, but then it is appended each time through the loop.

Code Sample - Scrape All Pages

Include this code in the scrape_music_austin.py file. It will scrape all the pages of the artists in the Austin Area. Look to the bottom of the page to see the total number of pages for the search you are doing. Review the comments in the code to understand what is happening at each step.

Moving On

Now you have a basic understanding of scraping concepts. There are other resources you can try that can help you scrape. One option is to use the importHTML function in a Google Sheet, this requires you to provide the url, either table or list (depending on what the data is in) and a number (to identify which list or table on the page). It is a quick and simple way to get data from an html page into a spreadsheet.

The following are Chrome extensions:

  • Scraper Extension - allows you to select an item on page and the scraper tries to find similar items. Install extension and then find Scrape Similar on the context menu (right-click or ctrl-click).
  • Web Scraper Extension - this provides a little more powerful functionality. Once installed, you will find the Web Scraper tab when you use the Chrome Developer Tools (Inspect on the context menu). More information on using this tool can be found here.

In the next exercise, you will use Python to access the Twitter API.