Python Tales - The Python Tutor Blog

A Python Programming Blog, from a Pythoneer to Pythoneers, created by The Python Tutor.

Saturday, July 29, 2017

Multithreading web scraping with Python

Hello my friends today in this entry we are going to talk about a very trendy topic, web scraping and how to do it with our beautiful Python programming language, so open your Python Idle and get set because in this Tutorial Entry we are going to code once again.

Web Scraping

First of all what is Web Scraping?
Scraping a web is a method used by software to extract meaningful information from a website. In the same way a human would do by exploring the web source code but in an automatic way. This automation is achieved by programming all of the steps that a human would, from retrieving the website using its URL, walking through its source code a find the patterns in this where the information is kept.
But let's talk a little about all the terms that we need to understand in order to move on with this tutorial.

HTML

Meaning Hypertext Markup Language, is the markup language that we need to understand before we are set to scrap a web site, the essence is that all structure and information inside a web site, a well coded website, has its internal and organized structure, using recognizable tags which between them they hold a relationship.
For this tutorial we are going to scrape a web site that has information about all of the CPUs available in the market, PassMark Software. In the main web site there is a table that classifies the CPUs by performance, value and hardware target, in the performance section we are going to crawl and scrape the High End CPU Chart Page.

This table has a list of approximately 600 CPUs, each with a link to another page with full specifications; let's take Intel Core i9-7900X for example:

For the simplicity of this tutorial we are going to crawl the high end chart and get for every CPU its SocketClockspeed and Turbospeed.

Code Tutorial - Walking Through the Code

First we are going gather all the tools that we need for our script, our libraries
# web_scraping_cpu.py

'''
    Script for Scrapping high end cpus on www.cpubenchmark.net
    https://www.cpubenchmark.net/high_end_cpus.html
'''

import requests
from bs4 import BeautifulSoup
import time

The library request and its functions make HTTP request as a web browser would do it, Beautiful Soup our second library, is the one that we are going to use for parsing HTML, using the Tags relationships in a way we can easily access the information that we need, and finally the time library, this one we will use it to markdown the time used for this script to achieve all the information from all CPUs, I point this out because we are going to access 600 website in sequence, this process could take several minutes, so in a way to reduce this time and improve this process performance, we are going to code later a Multithreading web scraping script.
First we are going to code some lines to access the High End Chart Table and from this, for each CPU (each row) retrieve its name, rank, price and URL for full specs. Some practice that I strongly recommend is to use Google Chrome Developer Tools, to inspect the source code to find patterns; you can right click on the chart and click Inspect from the context menu.

So as we can see the table holding all of CPUs are inside a DIV Tag identified by the id mark, this is something very important to highlight because we use CSS Selectors in the say way that we would use them with jQuery to select elements. Now if we expand the rows tags and its data tags we will have this.

For each row 3 data tags, with hyperlink tags, we need to point out that every tag has some properties, the text property which holds the information or content to be shown on the web browser, the properties under the hood, for instance href, is used for targeting the link when a user clicks on the text. Now let's do some coding.
# Web crawling
 
url_get = requests.get(main_url)
lxml_soup = BeautifulSoup(url_get.content, 'lxml')
 
cpu_rows = lxml_soup.select('#mark > table > tr')
cpu_rows = cpu_rows[1:-1]
cpus = []
 
for row in cpu_rows:
 
    # Page scraping
    
    td1 = row.select('td')[0]
    td2 = row.select('td')[1]
    td3 = row.select('td')[2]
 
    a1 = td1.find('a')
    a3 = td3.find('a')
 
    data = {}
 
    data['name'] = a1.text
    data['url'] = base_url + a1.get('href')
    data['price'] = a3.text
 
    data['score'] = td2.text.rstrip().strip()
 
    cpus.append(data)

We get our http response url_get, and for its content we parse it into a Beautiful Soup Object using the lxml parser, there are several parser that we could use, but this one is my recommendation, just trust me ;).
Now using some CSS selectors we are going to gather all of the table rows in a list of Tags Objects by using the select method, for this rows we slice them because the first and last rows are headers for the table. This are some tiny issues that you will find in very web site to scrape, it is important that you inspect very well the source code.
We initialize a list cpus variable, which we will use to hold all CPUs data before we move forward for scraping each CPUs full specs.
Using a for loop, we will walk through each row using a combination of the select and find methods for Tag Objects, remember this, select for css selectors, and find for Tag name selections, find well return the first one found, and finally the get method for retrieving the value from a specific property. And for each CPUs we will use a Dictionary and appending this to the cpus variable.
Some string methods are used to clean a little the information, like leading and trailing spaces, because we are getting the text as it is shown, sometime web developers need to format somehow the information, for instance prices tend to use the $ sign, for this tutorial we are not going to convert prices to real values, we are going to keep them as strings with their $ sign.
Finally we are going to scrape all of the CPUs specs.
# CPUs Scraping
 
tick = time.time()
fail_list = []
 
for cpu in cpus:
 
    cpu_url = cpu['url']
 
    url_get = requests.get(cpu_url)
    lxml_soup = BeautifulSoup(url_get.content, 'lxml')
 
    tables = lxml_soup.select('.desc')
    rows = tables[0].select('tr')
 
    tds = list(rows[1].select('td'))
    cell = list(tds[0].children)
 
    em = cell[0]
    info = list(em.children)
    
    try:
        cpu['Socket'] = info[2].strip().rstrip()
        cpu['Clockspeed'] = info[5].strip().rstrip()
        cpu['Turbospeed'] = info[8].strip().rstrip()
    except:
 
        fail_list.append(cpu['name'])
 
tock = time.time()

We take the same approach by using combinations of selects, but now we have used a new property that is children, HTML tags have parent-child relationships, and we can take all children by using the children property, this would return a Python iterator, we can fully convert this to a Python list so we can index or slice.
At the end of the for loop a try-except clause is used because perhaps not for all CPUs there is a Turbospeed information available, we need to be very careful, if the data in our website is not entirely standardized, our code could crash for key or index value errors. And when this happens we need to somehow log what went wrong, so for simplicity I have used a fail_list variable to hold all of the failure cases.
At the end we will have in the cpu list all CPUs information, everyone in a Dictionary inside a List, in another entry we are going to save this in a database so you can query some insights about this data. Finally we print the time took for this script to completely scrape all of the CPUs.
print("---- {0} seconds ----".format(tock - tick))

In my modest personal laptop (i3), this took 2583 seconds, approximately forty-three minutes. Some of you could say that it was fast some would say it was slow process. But either way we can improve this by using threads to scrap the entire CPUs specs page in concurrency.

Python Multithreading

There are many ways that you could use the multithreading module and its functions, in this tutorial we are going to use by defining a Runnable Class and target one instance of this for every thread worker. But first let's talk about how we are going to do this process, how do we move a sequential execution to a concurrency execution. Let's take a look at the next chart.

The first part of the process we already have it, take all of the CPUs basics info and links, but now instead of using a list, we are going to use a Queue, this is a Python Object from the queue library, this object piles information in some order, but the important thing to point out, is that is made for concurrency access, we are going to make 4 instances of our Runnable process, and each one of these threads, is going to take one item at a time from the Queue, and scrape the CPU specs until there is no more elements in the input_cpus variable, each one after getting the specs of a CPU, is going to put the retrieved data into an ouput Queue.
There are some advantages in this approach, the first is that the 4 threads work independently, so is like using 4 lines in a grocery store instead of one, making the processing faster, and if there is some failure in the Runnable code as we mentioned it before, that Thread will crash but the remaining threads will keep the work, so it is kind of failure proof processing.
Let's talk about the Runnable Class code.
# multithread_web_scraping.py
'''
    Script for Scrapping high end cpus on www.cpubenchmark.net
    https://www.cpubenchmark.net/high_end_cpus.html
    using multiple threads
'''
 
import requests
from bs4 import BeautifulSoup
import time
 
import queue
import threading
 
input_cpus = queue.Queue()
output_cpus = queue.Queue()
fail_cpus = queue.Queue()
 
class Runnable():
 
    def __call__(self):
 
        message = "\nThread {0} working hard"
 
        def process_cpu(cpu):
 
            cpu_url = cpu['url']
 
            url_get = requests.get(cpu_url)
            lxml_soup = BeautifulSoup(url_get.content, 'lxml')
 
            tables = lxml_soup.select('.desc')
            rows = tables[0].select('tr')
 
            tds = list(rows[1].select('td'))
            cell = list(tds[0].children)
 
            em = cell[0]
            info = list(em.children)
            
            try:
                cpu['Socket'] = info[2].strip().rstrip()
                cpu['Clockspeed'] = info[5].strip().rstrip()
                cpu['Turbospeed'] = info[8].strip().rstrip()
 
                output_cpus.put(cpu)
            except:
 
                fail_cpus.put(cpu['name'])
 
        while True:
 
            try:
 
                cpu = input_cpus.get(timeout=1)
 
            except:
 
                break
 
            print message.format(id(self))
 
            process_cpu(cpu)

As you can see is pretty much the same code as the sequential execution, in terms of accessing the CPU specs in the for loop used before, but here we use an infinite while loop, that takes one item at a time from the input_cpu Queue, and process it using the scrapping code by using a function called process_cpu. Here we use a try-except clause because eventually the Queue is going to be empty, and the code will crash, so we catch this a break the while loop.
Something very important is that we need to define this in a method called __call__, because when we place this into a thread, this will be the function that a thread would call as target. Finally the remaining code.
base_url = "https://www.cpubenchmark.net/"
main_url = "https://www.cpubenchmark.net/high_end_cpus.html"
 
# Web crawling
 
url_get = requests.get(main_url)
lxml_soup = BeautifulSoup(url_get.content, 'lxml')
 
cpu_rows = lxml_soup.select('#mark > table > tr')
cpu_rows = cpu_rows[1:-1]
 
for row in cpu_rows:
 
    # Page scraping
    
    td1 = row.select('td')[0]
    td2 = row.select('td')[1]
    td3 = row.select('td')[2]
 
    a1 = td1.find('a')
    a3 = td3.find('a')
 
    data = {}
 
    data['name'] = a1.text
    data['url'] = base_url + a1.get('href')
    data['price'] = a3.text
 
    data['score'] = td2.text.rstrip().strip()
 
    input_cpus.put(data)
 
# CPUs Scraping
 
tick = time.time()
 
# Threads
 
threads = []
 
for i in range(4):
 
    new_thread = threading.Thread(target=Runnable())
    new_thread.start()
    threads.append(new_thread)
 
for thread in threads:
 
    thread.join()
 
tock = time.time()
 
print("---- {0} seconds ----".format(tock - tick))

Pretty much the same code, but at the end we can see the threads instantiations, and how is made, you will find this in most of the multithreading tutorials. Finally you will have Queue output_cpus variable, with all of the CPU Dictionaries specs.
In my computer this process took 406 seconds, approximately six and half minutes, it was almost seven times faster that sequential processing. We can improve even better this with multiprocessing instead of multithreading but that is a subject for another entry.
We have come to the end of this entry; I hope you have found this entry helpful to you. So thanks for coming by my new entry and I hope you have enjoyed this as much as I enjoyed writing it, stop by the comments if you want to discuss about this, share in your social media and subscribe. Cheers my friends.