ADV 03 - Implement a concurrent headline scraper

Intro

A concurrent headline scraper is a program or tool that can extract news headlines from multiple sources simultaneously. It uses a technique called concurrency, which allows multiple tasks to be executed at the same time.

Task

In the lecture this week there is an example of a concurrent application that reports the size of the data at different URLs. If you look at the python code, you will find a program which goes to a set of URLs and gets the first 5 headlines back. However, it does not do this concurrently. Your task this week is the 3rd and final advanced viva task. All 5 standard tasks have been set, so this is your final task. The task is to implement a concurrent version of the code, which should do the same thing, but faster.

Keep it in mind, you can utilise the examples to archieve the goal. You can copy from them, import them, rewrite them.

To do this, you should continue to use concurrent.futures, as well as the Python newspaper module. The major work is in integrating these two things so that they work properly.

You should check that the headlines are being retrieved correctly (both number and content).

NB, it does not matter if some headlines turn out to be a section heading or other non-news content (which can happen, depending on how the news site has been organised).

You should use timeit (there’s an example in the code given) to compare and test the non-concurrent and concurrent versions. If the concurrent version is working properly, it should be faster than the non-concurrent version. The bigger the test number, the better the effect.

It may be useful to look at the documentation on concurrent.futures as well as newspaper.

Instructions

Select certified Python Codio stack.

Do:

sudo apt update
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade Pillow
pip3 install newspaper3k

Python code

import newspaper
from newspaper import Article

def get_headlines():

    URLs = ['http://www.foxnews.com/',
            'http://www.cnn.com/',
            'http://www.derspiegel.de/',
            'http://www.bbc.co.uk/',
            'https://theguardian.com',]

    for url in URLs:
        result = newspaper.build(url, memoize_articles=False)
        print('\n''The headlines from %s are' % url, '\n')
        for i in range(1,6):
            art = result.articles[i]
            art.download()
            art.parse()
            print(art.title)

if __name__ == '__main__':
    import timeit
    elapsed_time = timeit.timeit("get_headlines()", setup="from __main__ import get_headlines", number=2)/2             
    print(elapsed_time)

Appendix

concurrent concurrent with timing non-concurrent non-concurrent with timing concurrent with timing using threads