Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time

Asynchronous Code

This lab tasks will focus on how JavaScript is used to asynchronously get and set data on remote servers, often referred to as AJAX calls. It also covers a range of different ways to make asynchronous calls. By completing this lab you will be able to write cleaner, more modular code.

Note that AJAX technically uses XML as the data format, but we will be using JSON which is easier to parse, and becoming the defacto standard for much data on the web. The principles of asynchronous communication and callback functions are effectively identical for XML and JSON though.

Contents

  1. Async callbacks
  2. JSON data
  3. Modules and callbacks
  4. Nested callbacks
  5. Generators
  6. Promises
  7. Async functions
  8. Screen scraping

1 Async Callbacks

In this first exercise you will be studying the currency.js script to learn about two important concepts. Note that before you use this you will need to register for a new account to obtain a valid API key. This will need to be inserted into the code before running.

  1. How to pass parameters to your program as runtime parameters.
  2. How to run code in multiple threads using callbacks.

1.1 Runtime Parameters

When a JavaScript program is invoked from the console, the entire invocation string is available through the process.argv array, each word being stored in a different array index. This means that index 0 always contains the string node.

Study the currency.js script carefully. When we run this script we need to pass the currency we want as a runtime parameter like this: node currency GBP.

  1. If the script is invoked correctly there should be 3 indexes in the process.argv array. Index 0 contains the string node, index 1 contains the string currency and index 2 contains the string GBP.
    1. You can see the process.argv array logged to the console.
  2. If the array is shorter than 3 indexes we throw an error
  3. Finally we take the third index and convert it to upper case, storing the resulting string in an immutable variable (constant).

1.2 Callbacks

NodeJS is a single-threaded event loop that processes queued events. This means that if you were to execute a long-running task within a single thread then the process would block. To solve this problem, NodeJS relies on callbacks, which are functions that run after a long-running process has finished. Instead of waiting for the task to finish, the event loop moves on to the next piece of code. When the long-running task has finished, the callback is added to the event loop and run.

Because callbacks are such a fundamental part of NodeJS you need to spend time to make sure you fully understand how they work.

  1. The script uses a third-party package called request. To install this, make sure your terminal in pointing to the script directory and install it with the npm command (Node Package Manager) like this: npm install request.
  2. Try running the program with a single currency code node currency GBP.
  3. Because we want to throw exceptions if something unexpected happens, the code needs to be enclosed in a try-catch block.
  4. Next the URL is created. This string is known as a template literal and is enclosed using backticks instead of quotes. This allows variables to be embedded.
  5. When the request.get() method is called it takes two parameters The url to call and an anonymous function with three parameters, err, res and body. Note the use of the ECMA arrow function. This function callback will be run once the API call has completed, the API call running in its own thread.
  6. If the API request fails, the first parameter, err will be non-null and will contain an Error object. At this point we simply throw an exception and exit.
  7. The res parameter contains the entire response sent back from the server, we don't need this in this example.
  8. The body parameter contains the data returned from the API, this is what we will be using. it is returned as a string so we use JSON.parse() to turn it into a JavaScript object.
  9. Finally we extract the data we need from the JavaScript object and send it to the console for display. the JSON.stringify() function does the opposite of JSON.parse in that it turns a JavaScript object into a JSON string. The second parameter can be used to filter the results. The third parameter specifies the indentation to use when formatting.

1.3 Test Your Knowledge

Lets improve the currency exchange tool. You will need to refer to the API documentation as you work through the tasks.

Its often helpful to see the complete response body. This allows you to see precisely what data was returned. Add a line to print out the entire JSON object. This needs to be immediately after the body was parsed as a JavaScript object.

console.log(JSON.stringify(json, null, 2))
  1. Use the Number.prototype.toFixed() to truncate the number to 2 decimal places.
  2. Use the Chrome POSTMan plugin to make an API call to convert £ (GBP) to $ (USD). Take a moment to make sense of the structure of the JSON data.
  3. Modify the output of the script to display the currency conversion in a sensible format: e.g. 1 GBP = 1.33 USD.
  4. Finally, modify your program so that it throws an error if it doesn't recognise one of the currency codes.

2 JSON Data

Most RESTful APIs return their data as a string in JSON format. This format allows primitives, objects and arrays to be converted into a string, passed between systems as text and then converted to the correct JavaScript object at the receiving end.

In this exercise you will learn how to extract information from complex JSON data.

The first step is to create a Google account and obtain an API key using these detailed instructions.

Note: to complete this section you will will need to sign up with a payment card to check you are a vaild user. You won't be charged unless you upgrade your account. If you are not happy to do this please skip the rest of this section and go directly to the Modules and Callbacks task.

  1. Run the script by entering node addressFinder 'coventry', the address you are looking for needs to be enclosed in single quotes. Notice the result (lots of data).
  2. Open the addressFinder.js file and notice that the script requires at least three parameters. The user will need to enter an address to look up.
  3. The third parameter (index 2) contains the address to find.
  4. The API call is made, passing the correct parameters and when it is complete the callback code is executed.
  5. The body parameter string is parsed into a JavaScript Object.
  6. This is then converted back into a formatted JSON string and printed to the console.

2.1 Test Your Knowledge

In this exercise you will be extracting data from the JSON object and displaying it in the console.

  1. Try using a non-sensical address. What data is sent if Google can't resolve the address? Add an if statement to check for this and throw an exception if it is found.
  2. If a match is found, the JSON data will contain the longitude and latitude of the location. Extract this data and display it in a human format: lon: xxx, lat: xxx.
  3. The address_components array contains objects describing the full address. Write code to loop through this array and extract the long_name properties, printing them to the console.
  4. The bounds object contains the geo data defining the top-right and bottom-left of a box that contains the location. Write code to calculate the width and height of the box in degrees.

3 Modules and Callbacks

Lets recap a little about JavaScript functions. Functions are first-class objects of the type Object. This means they can be used just like other objects. You have already seen them stored in other variables.

In the previous examples you passed a function as an argument to another function. By passing a function argument we can execute it when we wish, for example after a network operation to retrieve data. In this context, the function is called a callback function.

In this topic you will learn how to create your own functions that take a callback function argument and how to store these in CommonJS modules, importing them where needed.

Locate the directions/ directory then open and run the index.js script. You will be prompted to enter a start and finish address, the script will return the driving distance between them. Test the exception handling by using both valid and invalid data.

  1. The directions module is imported.
  2. The getDistance() function it contains is called:
    1. This takes two string parameters
    2. The third parameter is a callback function
  3. The callback function takes two arguments:
    1. The first should always be an error object, this will be null if no error occurred.
    2. The second argument is the data returned.
  4. Exceptions are handled inside the callback function.
  5. The final line in the script executes before the callback function
  6. The callback function executes once the data has been retrieved, without blocking the thread.

Open the directions.js file and study it carefully.

  1. The request module is imported.
  2. The getDistance() function is exported.
  3. The third argument is the callback function which has two arguments, the error and the data. This is the recommended callback argument pattern sequence.
  4. The getDistance() function makes an aynchronous call to the request.get() function.
    • by isolating the API call in its own private function we won't need to duplicate this code when we add more functionality (the DRY principle).
  5. Its third parameter is a callback function.
  6. In the callback function we check for a non-null first parameter which would indicate an error has occurred.
    • If there has been an error we call our callback function and pass an Error object as its first parameter.
    • If no error has occurred we return null for the first parameter and the data as the second one.

3.1 Test Your Knowledge

  1. When the script runs, the url used in the API call is printed to the console. Copy this into Chrome Postman to see the entire API response body.
  2. Write a second function in your module called getDuration() which should print out how long the journey takes (in minutes).
  3. Write a third function called directions which returns an array of directions (HINT: html_instructions).

4 Nested Callbacks

Because the code to be run after a callback is run needs to be inside the callback code it is very challenging to build a script that contains several long-running tasks you get into a situation where you nest callbacks inside callbacks (inside callbacks) which makes the code very difficult to write, debug and read and means its very difficult to split into separate functions, a situation commonly known as Callback Hell.

Open the file nestedCallbacks.js which asks for a base currency code then prints out all the exchange rates against other currencies. Notice that there are four functions defined, three of which include a callback. Our script is designed to capture user input using stdin (needing a callback), identify whether a currency code is valid (requiring a second callback) and then getting the currency conversion rates for the specified currency (requiring a third callback).

  1. Notice that the checkValidCurrencyCode() function is called by the callback for the getInput() function and the getData() function is called by the callback for the checkValidCurrencyCode() function.
  2. Each callback takes two parameters as normal. The first contains the error (if any) and this needs to be handled in each callback.
  3. The data from the first callback is needed when calling the third function so needs to be stored in an immutable variable (constant).
  4. The fourth, and final, function does not have a callback.

Callbacks are the simplest possible mechanism for asynchronous code in JavaScript. Unfortunately, raw callbacks sacrifice the control flow, exception handling, and function semantics familiar from synchronous code.

4.1 Test Your Knowledge

The callbacks are already nested 3 deep. To test your knowledge of deeply nested callbacks you are going to create a script that has 6 levels of nested callbacks!

  1. modify the script to ask for the currency to convert to and display only the one conversion rate.
  2. instead of printing the exchange rate, ask for the amount to be converted and them return the equivalent in the chosen currency
  3. use the OpenExchangeRates API to display the full name of the chosen currency.

Even though the script is still simple you are probably already getting in a tangle! Imagine a more complex script with conditions, it would quickly get out of hand and become practically impossible to debug.

Thankfully there are a number of advance features in NodeJS that are designed to flatten out these callbacks and to treat asynchronous code in a more synchronous manner. These care called Generators, Promises and Async Functions and are described below. Even though you don't technically need to know these, its worth learning them to keep your code manageable.

5 Generators

Until now we have made certain assumptions about NodeJS functions. One of these is that once a function starts running it will always run to completion before any other code runs. A Generator is a different kind of function that can be paused at any time and resumed later.

In concurrent programming there are two types of concurrency, cooperative, which allows the process to determine when the interruption happens, and preemptive, which allows the process to be interrupted by another process. A Generator is an example of cooperative concurrency and use the yield keyword to trigger the interruption. To resume execution requires external control.

The cool feature of Generators is that messages can be passed to and from it.

Start by opening the generators.js file.

  1. The function *main() function declares a function generator which behaves much like a standard function.
  2. At the end of the script we use this to instantiate an iterator object we are calling it. This instantiates the iterator object but doesn't execute any of its contents.
  3. To start iterating over the generator function we call its .next() property, this runs the generator function up to the first yield keyword.
  4. The yield function pauses the generator function and passes control to the getInput() function, passing the parameter as normal.
  5. At the end of the getInput() function the .next() function is called on the it iterator object which passes control back to the generator function which runs until it encounters the next yield keyword...
  6. if an error occurs, the error object is passed to the iterator object's .throw() function (see the checkValidCurrencyCode() function to see this in action).
  7. Errors passed in this way are caught by the catch block in the generator function.

Simply by looking at the function generator you can see how it has completely eliminated the nested callbacks, making the code much easier to read (and debug).

5.1 Test Your Knowledge

The sample script generators.js has the same functionality as the previous script nestedCallbacks.js and your challenge is to implement the same changes as the previous challenge (repeated below). The good news is that you have already solved a lot of the coding challenges and so you can focus on how to implement it using generators.

  1. modify the script to ask for the currency to convert to and display only the one conversion rate.
  2. instead of printing the exchange rate, ask for the amount to be converted and them return the equivalent in the chosen currency
  3. use the OpenExchangeRates API to display the full name of the chosen currency

6 Promises

A promise is an object that proxies for the return value thrown by a function that has to do some asynchronous processing (Kris Kowal).

A promise represents the result of an asynchronous operation. As such it can be in one of three possible states:

  1. pending - the initial state of a promise.
  2. fulfilled - the asynchronous operation was successful.
  3. rejected - the asynchronous operation failed.

6.1 Creating a Promise

Promises are created using the new keyword. This function is called immediately with two arguments. The first argument resolves the promise and the second one rejects it. Once the appropriate argument is called the promise state changes.

const getData = url => new Promise( (resolve, reject) => {
  request(url, (err, res, body) => {
    if (err) reject(new Error('invalid API call'))
    resolve(body)
  })
})

This example creates a Promise that wraps a standard callback used to handle an API call. Notice that there are two possible cases handled here.

  1. If the API call throws an error we set the promise state to rejected.
  2. If the API call succeeds we set the promise state to fulfilled.

As you can see it it simple to wrap any async callbacks in promises but how are these called?

6.2 Consuming a Promise

To use promises we need a mechanism that gets triggered as soon as a promise changes state. A promise includes a then() method which gets called if the state changes to fulfilled and a catch() method that gets called if the state changes to rejected.

const aPromise = getData('http://api.fixer.io/latest?base=GBP')

aPromise.then( data => console.log(data))

aPromise.catch( err => console.error(`error: ${err.message}`) )

In this example we create a new Promise and store it in a variable. It get executed immediately. The second line calls its then() method which will get executed if the promise state becomes fulfilled (the API call is successful). The parameter will be assigned the value passed when the resolve() function is called in the promise, in this case it will contain the JSON data returned by the API call.

If the state of the promise changes to rejected, the catch() method is called. The parameter will be set to the value passed to the reject() function inside the promise. In this example it will contain an Error object.

This code can be written in a more concise way by chaining the promise methods.

getData('http://api.fixer.io/latest?base=GBP')
  .then( data => console.log(data))
  .catch( err => console.error(`error: ${err.message}`))

Because the Promise is executed immediately we don't need to store it in a variable. The .then() and .catch() methods are simply chained onto the promise. This form is much more compact and allows us to chain multiple promises together to solve more complex tasks.

6.3 Chaining Promises

The real power of promises comes from their ability to be chained. This allows the results from a promise to be passed to another promise. All you need to do is pass another promise to the next() method.

const getData = url => new Promise( (resolve, reject) => {
  request(url, (err, res, body) => {
    if (err) reject(new Error('invalid API call'))
    resolve(body)
  })
})

const printObject = data => new Promise( resolve => {
  const indent = 2
  data = JSON.parse(data)
  const str = JSON.stringify(data, null, indent)
  console.log(str)
  resolve()
})

const exit = () => new Promise( () => {
  process.exit()
})

getData('http://api.fixer.io/latest?base=GBP')
  .then( data => printObject(data))
  .then( () => exit())
  .catch(err => console.error(`error: ${err.message}`))
  .then( () => exit())

Notice that we pass the printObject promise to the then() method. The data passed back from the getData promise is passed to the printObject promise.

Because we can chain then() and catch() methods in any order we can add additional steps after the error has been handled. In the example above we want to exit the script whether or not an error has occurred.

Despite the code in the printObject promise being synchronous it is better to wrap this in a promise object to allow the steps to be chained.

If a promise only takes a single parameter and this matches the data passed back when the previous promise fulfills there is a more concise way to write this.

getData('http://api.fixer.io/latest?base=GBP')
  .then(printObject)
  .then(exit)
  .catch(err => console.error(`error: ${err.message}`))
  .then(exit)

There are some situations where you can't simply pass the output from one promise to the input of the next one. Sometimes you need to store data for use further down the promise chain. This can be achieved by storing the data in the this object.

getData('http://api.fixer.io/latest?base=GBP')
  .then( data => this.jsonData = data)
  .then( () => printObject(this.jsonData))
  .then(exit)
  .catch(err => console.error(`error: ${err.message}`))
  .then(exit)

In the example above we store the data returned from the getData promise in the this object. This is then used when we call the printObject promise.

6.4 Test Your Knowledge

Run the promises.js script, its functionality should be familiar to the currency.js script you worked with in chapter 3.

Open the promises.js script and study the code carefully. Notice that it defines 5 promises and chains them together. You are going to extend the functionality by defining some additional promises and adding them to the promise chain.

  1. modify the script to ask for the currency to convert to and display only the one conversion rate.
  2. instead of printing the exchange rate, ask for the amount to be converted and them return the equivalent in the chosen currency
  3. use the OpenExchangeRates API to display the full name of the chosen currency

6.5 Executing Code Concurrently

In the async examples we have seen so far, each async function needs to complete before the next async call is run. The diagram below shows how this looks.

         1      2      3
      ───⬤─────⬤─────⬤

The program flow is.

  1. The first async call getData is executed.
  2. Once this has completed, printObject is executed.
  3. Only when this has completed will the exit step execute.

There are many situations where two steps can run at the same time. This would be impossible to build using standard callbacks but this can be written using promises.

The first stage is to create an array of promises. Typically this is done by looping through an array of data and using this to return an array of promises.

const dataArray = ['USD', 'EUR']
const promiseArray = []
dataArray.forEach( curr => {
  promiseArray.push(new Promise( (resolve, reject) => {
    const url = `http://api.fixer.io/latest?base=GBP&symbols=${curr}`
    request.get(url, (err, res, body) => {
      if (err) reject(new Error(`could not get conversion rate for ${curr}`))
      resolve(body)
    })
  }))
})

In the example above we loop through the dataArray, creating a new promise object that we push onto our promiseArray.

Once we have an array of promises there are two possible scenarios.

  1. We want all the promises in the array to be fulfilled before continuing the promise chain.
  2. We want one of the promises to be fulfilled but we don't care which one.

6.5.1 Promises All

In the first scenario we want all the promises to be fulfilled before continuing and for this we use the Promises.all() method.

Promise.all(itemPromises)
  .then( results => results.forEach( item => console.log(item)))
  .catch( err => console.log(`error: ${err.message}`))

When the Promise.all() method fulfills it returns an array of results. In the example above we loop through these and print each to the terminal.

6.5.2 Promises Race

The alternative is that once one of the promises in the array has fulfilled we want to take its returned value and continue the promise chain. In this scenario we use Promise.race().

Promise.race(promiseArray)
  .then( result => console.log(result))
  .catch( err => console.log(`error: ${err.message}`))

As you can see, only a single value is returned by Promise.race(). In the example above you won't be able to predict which conversion rate will be returned but you will only get the one. A good application of this would be if you can get your data from multiple APIs but you don't know which ones are working.

7 Async Functions

In the previous sections we have covered the use of generators which allow the use of synchronous-style code to handle async code but the syntax is far from intuitive.

We then looked at the use of promises which allows you to wrap async code as a series of promises which can be chained together and implements exception handling. The price we pay for this is non-intuitive syntax which can become over complex. Async functions combine the benefits of promises with a clean synchronous-style syntax, avoiding the complex syntax used in promise chains. They are designed to simplify the behaviour of using promises in a synchronous manner.

Whenever we execute a function there is some implicit behaviour we expect. One behaviour is that, once invoked, a function will run until it gets to the end. Async functions break this behaviour, they can pause at any point and resume at a later point on the script. This enables us to write asynchronous code that looks and feels synchronous, it can even use standard try-catch execption handling.

  1. We can chain promises together in a cleaner way with full exception handling.
  2. We can substitute a promise with an async function without needing to change any other part of the script.

7.1 Simplifying Promise Chains

Here is a simple example.

const getData = url => new Promise( (resolve, reject) => {
  request(url, (err, res, body) => {
    if (err) reject(new Error('invalid API call'))
    resolve(body)
  })
})

const printObject = data => new Promise( resolve => {
  console.log(JSON.stringify(JSON.parse(data), null, 2))
  resolve()
})

async function main() {
  try {
    const data = await getData('http://api.fixer.io/latest?base=GBP')
    await printObject(data)
    process.exit()
  } catch (err) {
    console.log(`error: ${err.message}`)
    process.exit()
  }
}
main()

Async functions are declared using the async keyword in the function declaration, all errors are handled using the standard try-catch block. Because the main block of code needs to be in an async function, this has to be explicitly executed at the end of the script.

The getData() function returns a promise. it is called using the await keyword, this pauses the execution of the main() function until getData() is either fulfilled or rejected. If it is fulfilled, the data returned is stored in the data variable and control moves to the next line, if it is rejected code execution jumps to the catch() block.

7.2 Simplified Promises

Async functions are implicitly wrapped in a Promise.resolve() and any uncaught errors are wrapped in a Promise.reject(). This means that an async function can be substituted for a promise. let's look at a simple example.

const printObjectPromise = data => new Promise( (resolve) => {
  const indent = 2
  data = JSON.parse(data)
  const str = JSON.stringify(data, null, indent)
  console.log(str)
  resolve()
})

const printObjectAsync = async data => {
  const indent = 2
  data = JSON.parse(data)
  const str = JSON.stringify(data, null, indent)
  console.log(str)
}

both printObjectPromise and printObjectAsync behave in exactly the same manner. They both return a Promise.resolve() and so can be used in either a promise chain or an async function.

7.3 Test Your Knowledge

Run the asyncFunctions.js script. Note that it works in the same way as the previous ones. Open the script and study it carefully.

  1. modify the script to ask for the currency to convert to and display only the one conversion rate.
  2. instead of printing the exchange rate, ask for the amount to be converted and them return the equivalent in the chosen currency
  3. use the OpenExchangeRates API to display the full name of the chosen currency
  4. rewrite the printObject promise as an async function.
  5. rewrite another promise as an async function.

8 Screen Scraping

In the previous tasks we have been working with data that is available via a RESTful API but what do you do if the information you need is only found in human-readable format in an HTML webpage?

In this task you will learn how to extract data from HTML web pages, a technique known as Screen Scraping. This is a much harder that using an existing API because:

  • the html won't have semantic information
  • if the website author changes the page your script will need to be rewritten.

Despite these issues sometimes this approach is the only way to get the information you need.

Open the quotes/index.js file and notice that it imports a custom quotes module, the ./ indicates that it is in the current directory. Because the parsing code can get quite complex it is best practice to place this in a custom module.

There is only one function in this module, called getQuotes() which takes two parameters, the author name plus a callback. The callback follows best practice by passing an error as the first parameter followed by the data.

Now open the quotes/quotes.js module. The screen-scraping functionality is in a private function which is referenced by the exported function, this makes it easier to update if the web page layout changes.

If an error occurs the callback is called with an Error as the first parameter. If no error occurs, the callback takes a null first parameter with the data as a second parameter. This pattern is consistent with the built-in JavaScript functions that take a callback.

Run the index.js script and try searching for a valid person (such as Asimov), copy the URL into the chrome web browser.

Open the Developer Tools and choose the Elements tab. As you hover over the DOM elements in the Elements tab you will see the content highlighted in the browser window.

Use this to expand the DOM until you can highlight the first quote in the list.

  • Notice that all the quotes are in a <dl> (definition list) tag.
  • Each quote is in an <a> (anchor) tag a <dt> (definition term) tag.

In the scraper() function:

  1. The supplied parameters are used to create a unique url. It is absolutely vital that:
    • each resource have a unique URL.
    • the URL for each resource can be calculated based on the supplied parameters.
  2. The url is logged to the console so that it can be pasted into a browser to check for validity.
  3. The request module is used to grab the web page html which is available i the body parameter of the callback.
  4. The cheerio module loads the page DOM into a constant which can be parsed using JQuery syntax.
  5. We then check the DOM for particular elements.
    • If there is a <p> tag containing the text No quotations found we know the search has returned no quotations so an error is returned.
    • The number of quotes is extracted from the DOM and stored as a property of the data object.
    • An empty quotes[] array is added to the data object.
    • JQuery.each() is used to loop through each of the tags of interest.
    • Each quote is then pushed onto the quotes[] array.
  6. Once all the data has been extracted from the DOM and added to the data object this is passed to the callback function.

8.1 Test Your Knowledge

The best way to learn about screen scraping is to have a go. In this task you will be writing a script to search for books based on ISBN numbers and returning useful data.

You will be using the Amazon website, start by searching for a specific ISBN such as 1449336361, this will give you a URL to parse.

https://www.amazon.co.uk/JS-Next-ECMAScript-6-Aaron-Frost/dp/1449336361/ref=sr_1_1?ie=UTF8&qid=1475398158&sr=8-1&keywords=1449336361

The next step is to remove the unnecessary parts of the URL until you are left with something you can work with. This is a process of trial and error but you need to be able to construct this URL using only the ISBN number.

https://www.amazon.co.uk/dp/1449336361

Have a go at writing a books screen scraper and try to return:

  1. Title
  2. Authors
  3. Description
  4. Price
  5. Rating

Extension Activity

By now you whould have decided on the theme for your API.

  1. Identify any existing APIs you can integrate into your assignment. Write a NodeJS script to extract and display appropriate data.
  2. Identify websites that contain useful data and Write a NodeJS screen scraper to extract and display useful data (lists of items and details on specific items).

Asynchronous JavaScript

Outcomes

  • Understand and use command line options.
  • Understand and use callbacks to produce asynchronous code.
  • Understand the JSON data format and know how to convert between it and JavaScript objects.
  • Understand Screen Scraping

IO is Expensive

Waiting for IO to complete is big waste of resources Three solutions: synchronous processes: Apache threads: Node

NodeJS Threading Model

NodeJS runs in a single thread JavaScript supports lambda / callbacks Callbacks run in their own threads After callback thread is destroyed

Using Request.

Main methods correspond to HTTP verbs:

request.get(url, callback)
request.put(url, data, callback)
request.post(url, data, callback)
request.del(url, callback)

Be careful, because callbacks are asynchronous

Callbacks

A callback (higher-order) function

Passed around like a variable

a function that is passed to another function as a parameter

the callback function is called (or executed) inside the other Function.

When we pass a callback function as an argument to another function, we are only passing the function definition.

The containing function has the callback function in its parameter as a function definition

The function is not executed in the parameter.

It can execute the callback anytime.

Callbacks are important!

NodeJS runs in a single threaded event loop.

If a long-running operation occurs, the process stops "blocks" until the event has finished.

To prevent blocking operations any long running activities are run in callbacks.

The callback is a function that should be run after the operation is complete.

While it is processing, control is passed back to the main event loop.

Simple GET request with callback:

'use strict'
const request = require('request')
request.get( 'http://api.fixer.io/latest?symbols=GBP', (err, res, body) => {
  if (err) {
  console.log('could not complete request')
  }
  console.log(body)
})

Data Exchange Formats

RESTful APIs send data across the Internet

Needs to be transmitted as text (ASCII/UniCode)

Needs to communicate both the data and its structure.

  • Variables
  • Objects
  • Arrays

Common data exchange formats

  • XML - (Extensible Markup Language)
  • JSON – (JavaScript Object Notation)
  • YAML - (Yet Another Markup Language)
  • CSV - (Comma-Separated Values)

XML Example

<address>
  <org>Coventry University</org>
  <street>4 Gulson Road</street>
  <city>Coventry</city>
  <country>United Kingdom</country>
  <postcode>CV1 5FB</postcode>
</address>

JSON Example

address {
  "org": "Coventry University",
  "street": "4 Gulson Road",
  "city": "Coventry",
  "country": "United Kingdom",
  "postcode": "CV1 5FB",
}

YAML Example

address:
  org: "Coventry University"
  street: "4 Gulson Road"
  city: "Coventry"
  country: "United Kingdom"
  postcode: "CV1 5FB"

CSV Example

"org", "street", "city", "country", "postcode"
"Coventry University", "4 Gulson Road", "Coventry", "United Kingdom", "CV1 5FB"

Why do we prefer the JSON format?

  • Text-based
  • Position independent
  • Lightweight
  • Interoperable with JavaScript Objects

Converting to and from JSON

const jsObj = {
firstname: 'John',
lastname: "Doe"
}
const jsonStr = JSON.stringify(jsObj)
const jsonStr2 = JSON.stringify(jsObj, null, 2)
const newObj = JSON.parse(jsonStr)

Screen Scraping

Sometimes called Data Scraping

Extracting data from a human-readable web page

Why use screen scraping?

Some data not available through an API

Usually a last resort

Sometimes companies scrape their own websites!

There are some challenges:

  • Complex process
  • Needs deconstructable URLs
  • Success depends on the DOM not changing
  • Most search results are paginated

Deconstructable URLs.

To access search results:

  • Search term needs to be inserted into URL

To access resources:

  • Product ID needs to be inserted into URL.

Here are some examples:

Amazon Book Search URL (javascript)

https://www.amazon.co.uk/s/ref=nb_sb_noss_2?url=search-alias%3Dstripbooks&field-keywords=javascript
https://www.amazon.co.uk/s/?url=search-alias%3Dstripbooks&field-keywords=javascript

Guardian Bookstore

http://bookshop.theguardian.com/catalogsearch/result/?q=javascript&order=relevance&dir=desc
http://bookshop.theguardian.com/catalogsearch/result/?q=javascript

BBC iPlayer search for history

http://www.bbc.co.uk/iplayer/search?q=history

Accessing resources.

Amazon Books

https://www.amazon.co.uk/JavaScript-Definitive-Guide-Guides/dp/0596805527/ref=sr_1_2?s=books&ie=UTF8&qid=1476384737&sr=1-2&keywords=javascript
https://www.amazon.co.uk/dp/0596805527

http://bookshop.theguardian.com/javascript-patterns.html
But the ISBN is 0596806752

http://www.bbc.co.uk/iplayer/episode/b019c88d/the-grammar-school-a-secret-history-episode-2
http://www.bbc.co.uk/iplayer/episode/b019c88d

Screen scraping techniques

  • Browse to web page using Google Chrome
  • Open the Developer tools (elements tab)
  • Expand DOM structure and see what content it controls
  • Uniquely identify the data
  • Extract data using JQuery patterns

Module for screen scraper.

Process is messy

Needs updating when page structure changes

Need to isolate in its own module

Keep public interface simple

Public interface:

  • Pass a search string and get a JavaScript array in return
  • Pass a resource identifier and get a JavaScript resource back