This worksheet is optional and covers more advanced Asynchronous techniques than were covered in the previous worksheet.
Note that AJAX technically uses XML as the data format, but we will be using JSON which is easier to parse, and becoming the defacto standard for much data on the web. The principles of asynchronous communication and callback functions are effectively identical for XML and JSON though.
Before you start this worksheet make sure you have the latest lab materials:
$ git stash
$ git pull origin master
$ git stash pop
If the VI editor window pops open:
- press the Esc key.
- type
:wq
and press the Enter key.
This chapter covers a wide number of topics associated with running async code that will greatly improve your knowledge of how asyncronous code works and, should you choose to implement these, will lead to much cleaner code.
- Async callbacks *
- JSON data *
- Modules and callbacks *
- Nested callbacks *
- Generators
- Promises
- Async functions
- Screen scraping
Any section above marked with an asterix *
should be considered essential knowledge.
In this first exercise you will be studying the currency.js
script to learn about two important concepts.
- How to pass parameters to your program as runtime parameters.
- How to run code in multiple threads using callbacks.
When a JavaScript program is invoked from the console, the entire invocation string is available through the process.argv
array, each word being stored in a different array index. This means that index 0 always contains the string node
.
Study the currency.js
script carefully. When we run this script we need to pass the currency we want as a runtime parameter like this: node currency GBP
.
- If the script is invoked correctly there should be 3 indexes in the
process.argv
array. Index 0 contains the stringnode
, index 1 contains the stringcurrency
and index 2 contains the stringGBP
. - If the array is shorter than 3 indexes we throw an error
- Finally we take the third index and convert it to upper case, storing the resulting string in an immutable variable (constant).
NodeJS is a single-threaded event loop that processes queued events. This means that if you were to execute a long-running task within a single thread then the process would block. To solve this problem, NodeJS relies on callbacks, which are functions that run after a long-running process has finished. Instead of waiting for the task to finish, the event loop moves on to the next piece of code. When the long-running task has finished, the callback is added to the event loop and run.
Because callbacks are such a fundamental part of NodeJS you need to spend time to make sure you fully understand how they work.
- The script uses a third-party package called
request
. To install this, make sure your terminal in pointing to the script directory and install it with thenpm
command (Node Package Manager) like this:npm install request
. - Try running the program with a single currency code
node currency GBP
. - Because we want to throw exceptions if something unexpected happens, the code needs to be enclosed in a try-catch block.
- Next the URL is created. This string is known as a template literal and is enclosed using backticks instead of quotes. This allows variables to be embedded.
- When the
request.get()
method is called it takes two parameters The url to call and an anonymous function with three parameters,err
,res
andbody
. Note the use of the ECMA arrow function. This function callback will be run once the API call has completed, the API call running in its own thread. - If the API request fails, the first parameter,
err
will be non-null and will contain an Error object. At this point we simply throw an exception and exit. - The
res
parameter contains the entire response sent back from the server, we don't need this in this example. - The
body
parameter contains the data returned from the API, this is what we will be using. it is returned as a string so we useJSON.parse()
to turn it into a JavaScript object. - Finally we extract the data we need from the JavaScript object and send it to the console for display. the
JSON.stringify()
function does the opposite ofJSON.parse
in that it turns a JavaScript object into a JSON string. The second parameter can be used to filter the results. The third parameter specifies the indentation to use when formatting.
Lets improve the currency exchange tool. You will need to refer to the API documentation as you work through the tasks.
Its often helpful to see the complete response body. This allows you to see precisely what data was returned. Add a line to print out the entire JSON object. This needs to be immediately after the body was parsed as a JavaScript object.
console.log(JSON.stringify(json, null, 2))
- At the moment, all the exchange rates are based on the € (EUR) however the API allows this to be changed by adding a second key to the querystring. Read through the documentation and modify the API call to use £ (GBP) as the base currency.
- Modify the program to take two parameters: the first should be the base currency and the second the currency to convert to.
- Use the Number.prototype.toFixed() to truncate the number to 2 decimal places.
- Use the Chrome POSTMan plugin to make an API call to convert £ (GBP) to $ (USD). Take a moment to make sense of the structure of the JSON data.
- Modify the output of the script to display the currency conversion in a sensible format: e.g.
1 GBP = 1.33 USD
. - Finally, modify your program so that it throws an error if it doesn't recognise one of the currency codes.
Most RESTful APIs return their data as a string in JSON format. This format allows primitives, objects and arrays to be converted into a string, passed between systems as text and then converted to the correct JavaScript object at the receiving end.
In this exercise you will learn how to extract information from complex JSON data.
- Run the script by entering
node addressFinder 'coventry'
, the address you are looking for needs to be enclosed in single quotes. Notice the result (lots of data). - Open the
addressFinder.js
file and notice that the script requires at least three parameters. The user will need to enter an address to look up. - The third parameter (index 2) contains the address to find.
- The API call is made, passing the correct parameters and when it is complete the callback code is executed.
- The
body
parameter string is parsed into a JavaScript Object. - This is then converted back into a formatted JSON string and printed to the console.
In this exercise you will be extracting data from the JSON object and displaying it in the console.
- Try using a non-sensical address. What data is sent if Google can't resolve the address? Add an if statement to check for this and throw an exception if it is found.
- If a match is found, the JSON data will contain the longitude and latitude of the location. Extract this data and display it in a human format:
lon: xxx, lat: xxx
. - The
address_components
array contains objects describing the full address. Write code to loop through this array and extract thelong_name properties
, printing them to the console. - The
bounds
object contains the geo data defining the top-right and bottom-left of a box that contains the location. Write code to calculate the width and height of the box in degrees.
Lets recap a little about JavaScript functions. Functions are first-class objects of the type Object. This means they can be used just like other objects. You have already seen them stored in other variables.
In the previous examples you passed a function as an argument to another function. By passing a function argument we can execute it when we wish, for example after a network operation to retrieve data. In this context, the function is called a callback function.
In this topic you will learn how to create your own functions that take a callback function argument and how to store these in CommonJS modules, importing them where needed.
Locate the directions/
directory then open and run the index.js
script. You will be prompted to enter a start and finish address, the script will return the driving distance between them. Test the exception handling by using both valid and invalid data.
- The
directions
module is imported. - The
getDistance()
function it contains is called: - This takes two string parameters
- The third parameter is a callback function
- The callback function takes two arguments:
- The first should always be an error object, this will be
null
if no error occurred. - The second argument is the data returned.
- Exceptions are handled inside the callback function.
- The final line in the script executes before the callback function
- The callback function executes once the data has been retrieved, without blocking the thread.
Open the directions.js
file and study it carefully.
- The
request
module is imported. - The
getDistance()
function is exported. - The third argument is the callback function which has two arguments, the error and the data. This is the recommended callback argument pattern sequence.
- The
getDistance()
function makes an aynchronous call to therequest.get()
function.
- by isolating the API call in its own private function we won't need to duplicate this code when we add more functionality (the DRY principle).
- Its third parameter is a callback function.
- In the callback function we check for a non-null first parameter which would indicate an error has occurred.
- If there has been an error we call our callback function and pass an Error object as its first parameter.
- If no error has occurred we return null for the first parameter and the data as the second one.
- When the script runs, the url used in the API call is printed to the console. Copy this into Chrome Postman to see the entire API response body.
- Write a second function in your module called
getDuration()
which should print out how long the journey takes (in minutes). - Write a third function called directions which returns an array of directions (HINT:
html_instructions
).
Because the code to be run after a callback is run needs to be inside the callback code it is very challenging to build a script that contains several long-running tasks you get into a situation where you nest callbacks inside callbacks (inside callbacks) which makes the code very difficult to write, debug and read and means its very difficult to split into separate functions, a situation commonly known as Callback Hell.
Open the file nestedCallbacks.js
which asks for a base currency code then prints out all the exchange rates against other currencies. Notice that there are four functions defined, three of which include a callback. Our script is designed to capture user input using stdin
(needing a callback), identify whether a currency code is valid (requiring a second callback) and then getting the currency conversion rates for the specified currency (requiring a third callback).
- Notice that the
checkValidCurrencyCode()
function is called by the callback for thegetInput()
function and thegetData()
function is called by the callback for thecheckValidCurrencyCode()
function. - Each callback takes two parameters as normal. The first contains the error (if any) and this needs to be handled in each callback.
- The data from the first callback is needed when calling the third function so needs to be stored in an immutable variable (constant).
- The fourth, and final, function does not have a callback.
Callbacks are the simplest possible mechanism for asynchronous code in JavaScript. Unfortunately, raw callbacks sacrifice the control flow, exception handling, and function semantics familiar from synchronous code.
The callbacks are already nested 3 deep. To test your knowledge of deeply nested callbacks you are going to create a script that has 6 levels of nested callbacks!
- modify the script to ask for the currency to convert to and display only the one conversion rate.
- instead of printing the exchange rate, ask for the amount to be converted and them return the equivalent in the chosen currency
- use the OpenExchangeRates API to display the full name of the chosen currency.
Even though the script is still simple you are probably already getting in a tangle! Imagine a more complex script with conditions, it would quickly get out of hand and become practically impossible to debug.
Thankfully there are a number of advance features in NodeJS that are designed to flatten out these callbacks and to treat asynchronous code in a more synchronous manner. These care called Generators, Promises and Async Functions and are described below. Even though you don't technically need to know these, its worth learning them to keep your code manageable.
Until now we have made certain assumptions about NodeJS functions. One of these is that once a function starts running it will always run to completion before any other code runs. A Generator is a different kind of function that can be paused at any time and resumed later.
In concurrent programming there are two types of concurrency, cooperative, which allows the process to determine when the interruption happens, and preemptive, which allows the process to be interrupted by another process. A Generator is an example of cooperative concurrency and use the yield
keyword to trigger the interruption. To resume execution requires external control.
The cool feature of Generators is that messages can be passed to and from it.
Start by opening the generators.js
file.
- The
function *main()
function declares a function generator which behaves much like a standard function. - At the end of the script we use this to instantiate an iterator object we are calling
it
. This instantiates the iterator object but doesn't execute any of its contents. - To start iterating over the generator function we call its
.next()
property, this runs the generator function up to the firstyield
keyword. - The
yield
function pauses the generator function and passes control to thegetInput()
function, passing the parameter as normal. - At the end of the
getInput()
function the.next()
function is called on theit
iterator object which passes control back to the generator function which runs until it encounters the nextyield
keyword... - if an error occurs, the error object is passed to the iterator object's
.throw()
function (see thecheckValidCurrencyCode()
function to see this in action). - Errors passed in this way are caught by the
catch
block in the generator function.
Simply by looking at the function generator you can see how it has completely eliminated the nested callbacks, making the code much easier to read (and debug).
The sample script generators.js
has the same functionality as the previous script nestedCallbacks.js
and your challenge is to implement the same changes as the previous challenge (repeated below). The good news is that you have already solved a lot of the coding challenges and so you can focus on how to implement it using generators.
- modify the script to ask for the currency to convert to and display only the one conversion rate.
- instead of printing the exchange rate, ask for the amount to be converted and them return the equivalent in the chosen currency
- use the OpenExchangeRates API to display the full name of the chosen currency
A promise is an object that proxies for the return value thrown by a function that has to do some asynchronous processing (Kris Kowal).
A promise represents the result of an asynchronous operation. As such it can be in one of three possible states:
- pending - the initial state of a promise.
- fulfilled - the asynchronous operation was successful.
- rejected - the asynchronous operation failed.
Promises are created using the new
keyword. This function is called immediately with two arguments. The first argument resolves the promise and the second one rejects it. Once the appropriate argument is called the promise state changes.
const getData = url => new Promise( (resolve, reject) => {
request(url, (err, res, body) => {
if (err) reject(new Error('invalid API call'))
resolve(body)
})
})
This example creates a Promise
that wraps a standard callback used to handle an API call. Notice that there are two possible cases handled here.
- If the API call throws an error we set the promise state to rejected.
- If the API call succeeds we set the promise state to fulfilled.
As you can see it it simple to wrap any async callbacks in promises but how are these called?
To use promises we need a mechanism that gets triggered as soon as a promise changes state. A promise includes a then()
method which gets called if the state changes to fulfilled and a catch()
method that gets called if the state changes to rejected.
const aPromise = getData('http://api.fixer.io/latest?base=GBP')
aPromise.then( data => console.log(data))
aPromise.catch( err => console.error(`error: ${err.message}`) )
In this example we create a new Promise and store it in a variable. It get executed immediately. The second line calls its then()
method which will get executed if the promise state becomes fulfilled (the API call is successful). The parameter will be assigned the value passed when the resolve()
function is called in the promise, in this case it will contain the JSON data returned by the API call.
If the state of the promise changes to rejected, the catch()
method is called. The parameter will be set to the value passed to the reject()
function inside the promise. In this example it will contain an Error
object.
This code can be written in a more concise way by chaining the promise methods.
getData('http://api.fixer.io/latest?base=GBP')
.then( data => console.log(data))
.catch( err => console.error(`error: ${err.message}`))
Because the Promise is executed immediately we don't need to store it in a variable. The .then()
and .catch()
methods are simply chained onto the promise. This form is much more compact and allows us to chain multiple promises together to solve more complex tasks.
The real power of promises comes from their ability to be chained. This allows the results from a promise to be passed to another promise. All you need to do is pass another promise to the next()
method.
const getData = url => new Promise( (resolve, reject) => {
request(url, (err, res, body) => {
if (err) reject(new Error('invalid API call'))
resolve(body)
})
})
const printObject = data => new Promise( resolve => {
const indent = 2
data = JSON.parse(data)
const str = JSON.stringify(data, null, indent)
console.log(str)
resolve()
})
const exit = () => new Promise( () => {
process.exit()
})
getData('http://api.fixer.io/latest?base=GBP')
.then( data => printObject(data))
.then( () => exit())
.catch(err => console.error(`error: ${err.message}`))
.then( () => exit())
Notice that we pass the printObject
promise to the then()
method. The data passed back from the getData
promise is passed to the printObject
promise.
Because we can chain then()
and catch()
methods in any order we can add additional steps after the error has been handled. In the example above we want to exit the script whether or not an error has occurred.
Despite the code in the printObject
promise being synchronous it is better to wrap this in a promise object to allow the steps to be chained.
If a promise only takes a single parameter and this matches the data passed back when the previous promise fulfills there is a more concise way to write this.
getData('http://api.fixer.io/latest?base=GBP')
.then(printObject)
.then(exit)
.catch(err => console.error(`error: ${err.message}`))
.then(exit)
There are some situations where you can't simply pass the output from one promise to the input of the next one. Sometimes you need to store data for use further down the promise chain. This can be achieved by storing the data in the this
object.
getData('http://api.fixer.io/latest?base=GBP')
.then( data => this.jsonData = data)
.then( () => printObject(this.jsonData))
.then(exit)
.catch(err => console.error(`error: ${err.message}`))
.then(exit)
In the example above we store the data returned from the getData
promise in the this
object. This is then used when we call the printObject
promise.
Run the promises.js
script, its functionality should be familiar to the currency.js
script you worked with in chapter 3.
Open the promises.js
script and study the code carefully. Notice that it defines 5 promises and chains them together. You are going to extend the functionality by defining some additional promises and adding them to the promise chain.
- modify the script to ask for the currency to convert to and display only the one conversion rate.
- instead of printing the exchange rate, ask for the amount to be converted and them return the equivalent in the chosen currency
- use the OpenExchangeRates API to display the full name of the chosen currency
In the async examples we have seen so far, each async function needs to complete before the next async call is run. The diagram below shows how this looks.
1 2 3
───⬤─────⬤─────⬤
The program flow is.
- The first async call
getData
is executed. - Once this has completed,
printObject
is executed. - Only when this has completed will the
exit
step execute.
There are many situations where two steps can run at the same time. This would be impossible to build using standard callbacks but this can be written using promises.
The first stage is to create an array of promises. Typically this is done by looping through an array of data and using this to return an array of promises.
const dataArray = ['USD', 'EUR']
const promiseArray = []
dataArray.forEach( curr => {
promiseArray.push(new Promise( (resolve, reject) => {
const url = `http://api.fixer.io/latest?base=GBP&symbols=${curr}`
request.get(url, (err, res, body) => {
if (err) reject(new Error(`could not get conversion rate for ${curr}`))
resolve(body)
})
}))
})
In the example above we loop through the dataArray
, creating a new promise object that we push onto our promiseArray
.
Once we have an array of promises there are two possible scenarios.
- We want all the promises in the array to be fulfilled before continuing the promise chain.
- We want one of the promises to be fulfilled but we don't care which one.
In the first scenario we want all the promises to be fulfilled before continuing and for this we use the Promises.all()
method.
Promise.all(itemPromises)
.then( results => results.forEach( item => console.log(item)))
.catch( err => console.log(`error: ${err.message}`))
When the Promise.all()
method fulfills it returns an array of results. In the example above we loop through these and print each to the terminal.
The alternative is that once one of the promises in the array has fulfilled we want to take its returned value and continue the promise chain. In this scenario we use Promise.race()
.
Promise.race(promiseArray)
.then( result => console.log(result))
.catch( err => console.log(`error: ${err.message}`))
As you can see, only a single value is returned by Promise.race()
. In the example above you won't be able to predict which conversion rate will be returned but you will only get the one. A good application of this would be if you can get your data from multiple APIs but you don't know which ones are working.
In the previous sections we have covered the use of generators which allow the use of synchronous-style code to handle async code but the syntax is far from intuitive.
We then looked at the use of promises which allows you to wrap async code as a series of promises which can be chained together and implements exception handling. The price we pay for this is non-intuitive syntax which can become over complex. Async functions combine the benefits of promises with a clean synchronous-style syntax, avoiding the complex syntax used in promise chains. They are designed to simplify the behaviour of using promises in a synchronous manner.
Whenever we execute a function there is some implicit behaviour we expect. One behaviour is that, once invoked, a function will run until it gets to the end. Async functions break this behaviour, they can pause at any point and resume at a later point on the script. This enables us to write asynchronous code that looks and feels synchronous, it can even use standard try-catch
execption handling.
- We can chain promises together in a cleaner way with full exception handling.
- We can substitute a promise with an async function without needing to change any other part of the script.
Here is a simple example.
const getData = url => new Promise( (resolve, reject) => {
request(url, (err, res, body) => {
if (err) reject(new Error('invalid API call'))
resolve(body)
})
})
const printObject = data => new Promise( resolve => {
console.log(JSON.stringify(JSON.parse(data), null, 2))
resolve()
})
async function main() {
try {
const data = await getData('http://api.fixer.io/latest?base=GBP')
await printObject(data)
process.exit()
} catch (err) {
console.log(`error: ${err.message}`)
process.exit()
}
}
main()
Async functions are declared using the async
keyword in the function declaration, all errors are handled using the standard try-catch
block. Because the main block of code needs to be in an async function, this has to be explicitly executed at the end of the script.
The getData()
function returns a promise. it is called using the await
keyword, this pauses the execution of the main()
function until getData()
is either fulfilled or rejected. If it is fulfilled, the data returned is stored in the data
variable and control moves to the next line, if it is rejected code execution jumps to the catch()
block.
Async functions are implicitly wrapped in a Promise.resolve()
and any uncaught errors are wrapped in a Promise.reject()
. This means that an async function can be substituted for a promise. let's look at a simple example.
const printObjectPromise = data => new Promise( (resolve) => {
const indent = 2
data = JSON.parse(data)
const str = JSON.stringify(data, null, indent)
console.log(str)
resolve()
})
const printObjectAsync = async data => {
const indent = 2
data = JSON.parse(data)
const str = JSON.stringify(data, null, indent)
console.log(str)
}
both printObjectPromise
and printObjectAsync
behave in exactly the same manner. They both return a Promise.resolve()
and so can be used in either a promise chain or an async function.
Run the asyncFunctions.js
script. Note that it works in the same way as the previous ones. Open the script and study it carefully.
- modify the script to ask for the currency to convert to and display only the one conversion rate.
- instead of printing the exchange rate, ask for the amount to be converted and them return the equivalent in the chosen currency
- use the OpenExchangeRates API to display the full name of the chosen currency
- rewrite the
printObject
promise as an async function. - rewrite another promise as an async function.
One of the more intriguing features of JavaScript is its support for a paradigm called functional programming. In simple terms this includes:
- The contents of a variable can't change once assigned (constants only).
- The elimination of loops and control structures.
- The use of higher-order functions.
Whilst this list is far from complete it allows us to experiment with an alternative (and powerful) way to write programs. Open the functional.js
file to understand how this is achieved. In it we will be manipulating lists of data (arrays) by applying the functional concepts listed above.
The Array.map()
function creates a new Array by calling the provided function on each element.
In this example, we define a function called makeUpperCase()
. The parameter will be the array index. This is then passed to the Array.map()
function
const names = ['Mark', 'John', 'Stephen']
function makeUpperCase(name) {
return name.toUpperCase()
}
const upper = names.map(makeUpperCase)
Whilst this works fine we normally avoid creating a named function and pass an anonymous function. The example below has identical functionality to the previous example.
const names = ['Mark', 'John', 'Stephen']
const upper2 = names.map( value => {
return value.toUpperCase()
})
By using the Arrow Function the return statement is inferred if the parenthesis ({}
) are removed. The example below is logically identical to the example above.
const names = ['Mark', 'John', 'Stephen']
const upper3 = names.map( value => value.toUpperCase() )
- Use the
Array.map()
function to create an array with all the names in lower case only.
The Array.filter()
method creates an array filled with all array elements that pass a
test (provided as a function)
const data = ['Coventry', 3.14159, 42]
function getInt(val) {
if (Number.isInteger(val)) {
return true
}
return false
}
const integers = data.filter(getInt)
const integers = data.filter( val => Number.isInteger(val) )
As before we can use the feature of the arrow function syntax to reduce the above to a single line.
const integers = ['Coventry', 3.14159, 42].filter( val => Number.isInteger(val) )
Here is an example showing the use of the typeof
statement to return values of type String
.
function getStr(val) {
if (typeof val === 'string') {
return true
}
return false
}
const strings = ['Coventry', 3.14159, 42].filter(getStr)
As before we can rewrite this as a single line. This is logically identical to the example above.
const strings = ['Coventry', 3.14159, 42].filter( val => typeof val === 'string')
- eturn an array that only contains floating point numbers (non whole numbers). Hint: Since all numbers are the same data type (Number) you will need check both the data type and whether it is a whole number (using modulo division).
- Now turn it into a single line function by removing the braces and return statement. */
- You should now be able to return an array that only contains integers (whole numbers). */
The Array.reduce()
function takes an array and reduces it to a single value using
an accumulator variable to track the result. Notice that the anonymous function has 2
parameters, the accumulator (that passes its current value) and the array value.
The value returned by the function becomes the value of the accumulator.
function getLongest(acc, val) {
if (val.length > acc.length) {
return val
} else {
return acc
}
}
const longest = ['Mark', 'John', 'Stephen'].reduce(getLongest)
Again, we can use an anonymous function to avoid having to define the named function.
const longest = ['Mark', 'John', 'Stephen'].reduce( (acc, val) => {
if (val.length > acc.length) {
return val
} else {
return acc
}
})
The Array.reduce()
function takes a second parameter which allows you to specify an initial value for the accumulator (if this is omitted it is assigned a value of 0
). This allows the function to be used in a number of surprising ways. Take a look at the followin example and see if you can figure out what is does and how it works.
function reverse(acc, val) {
return val + acc
}
const rev = 'william'.split('').reduce(reverse, '')
As before, we can use an anonymous function.
const rev = 'william'.split('').reduce( (acc, val) => {
acc.unshift(val)
return acc`
}, [])
And by taking advantage of the arrow function syntax we have a single line.
const rev = 'william'.split('').reduce( (acc, val) => val + acc, '')
- Write a single-line script to return the longest name in an array. You will need to use the
Conditional Operator
.
One of the benefits of these array functions is that they are applied to arrays and they each return an array. This means that we can combine them to solve more complex problems.
- Return the largest integer by chaining filter and reduce.
In the previous tasks we have been working with data that is available via a RESTful API but what do you do if the information you need is only found in human-readable format in an HTML webpage?
In this task you will learn how to extract data from HTML web pages, a technique known as Screen Scraping. This is a much harder that using an existing API because:
- the html won't have semantic information
- if the website author changes the page your script will need to be rewritten.
Despite these issues sometimes this approach is the only way to get the information you need.
Open the quotes/index.js
file and notice that it imports a custom quotes
module, the ./
indicates that it is in the current directory. Because the parsing code can get quite complex it is best practice to place this in a custom module.
There is only one function in this module, called getQuotes()
which takes two parameters, the author name plus a callback. The callback follows best practice by passing an error as the first parameter followed by the data.
Now open the quotes/quotes.js
module. The screen-scraping functionality is in a private function which is referenced by the exported function, this makes it easier to update if the web page layout changes.
If an error occurs the callback is called with an Error as the first parameter. If no error occurs, the callback takes a null
first parameter with the data as a second parameter. This pattern is consistent with the built-in JavaScript functions that take a callback.
Run the index.js
script and try searching for a valid person (such as Asimov), copy the URL into the chrome web browser.
Open the Developer Tools and choose the Elements tab. As you hover over the DOM elements in the Elements tab you will see the content highlighted in the browser window.
Use this to expand the DOM until you can highlight the first quote in the list.
- Notice that all the quotes are in a
<dl>
(definition list) tag. - Each quote is in an
<a>
(anchor) tag a<dt>
(definition term) tag.
In the scraper()
function:
- The supplied parameters are used to create a unique url. It is absolutely vital that:
- each resource have a unique URL.
- the URL for each resource can be calculated based on the supplied parameters.
- The url is logged to the console so that it can be pasted into a browser to check for validity.
- The
request
module is used to grab the web page html which is available i thebody
parameter of the callback. - The
cheerio
module loads the page DOM into a constant which can be parsed using JQuery syntax. - We then check the DOM for particular elements.
- If there is a
<p>
tag containing the textNo quotations found
we know the search has returned no quotations so an error is returned. - The number of quotes is extracted from the DOM and stored as a property of the data object.
- An empty
quotes[]
array is added to thedata
object. - JQuery.each() is used to loop through each of the tags of interest.
- Each quote is then pushed onto the
quotes[]
array.
- Once all the data has been extracted from the DOM and added to the
data
object this is passed to the callback function.
The best way to learn about screen scraping is to have a go. In this task you will be writing a script to search for books based on ISBN numbers and returning useful data.
You will be using the Amazon website, start by searching for a specific ISBN such as 1449336361
, this will give you a URL to parse.
The next step is to remove the unnecessary parts of the URL until you are left with something you can work with. This is a process of trial and error but you need to be able to construct this URL using only the ISBN number.
https://www.amazon.co.uk/dp/1449336361
Have a go at writing a books
screen scraper and try to return:
- Title
- Authors
- Description
- Price
- Rating
By now you whould have decided on the theme for your API.
- Identify any existing APIs you can integrate into your assignment. Write a NodeJS script to extract and display appropriate data.
- Identify websites that contain useful data and Write a NodeJS screen scraper to extract and display useful data (lists of items and details on specific items).
Outcomes
- Understand and use command line options.
- Understand and use callbacks to produce asynchronous code.
- Understand the JSON data format and know how to convert between it and JavaScript objects.
- Understand Screen Scraping
Waiting for IO to complete is big waste of resources Three solutions: synchronous processes Apache threads Node
NodeJS runs in a single thread JavaScript supports lambda / callbacks Callbacks run in their own threads After callback thread is destroyed
Using Request.
Main methods correspond to HTTP verbs:
request.get(url, callback)
request.put(url, data, callback)
request.post(url, data, callback)
request.del(url, callback)
Be careful, because callbacks are asynchronous
A callback (higher-order) function
Passed around like a variable
a function that is passed to another function as a parameter
the callback function is called (or executed) inside the other Function.
When we pass a callback function as an argument to another function, we are only passing the function definition.
The containing function has the callback function in its parameter as a function definition
The function is not executed in the parameter.
It can execute the callback anytime.
Callbacks are important!
NodeJS runs in a single threaded event loop.
If a long-running operation occurs, the process stops "blocks" until the event has finished.
To prevent blocking operations any long running activities are run in callbacks.
The callback is a function that should be run after the operation is complete.
While it is processing, control is passed back to the main event loop.
Simple GET request with callback:
'use strict'
const request = require('request')
request.get( 'http://api.fixer.io/latest?symbols=GBP', (err, res, body) => {
if (err) {
console.log('could not complete request')
}
console.log(body)
})
RESTful APIs send data across the Internet
Needs to be transmitted as text (ASCII/UniCode)
Needs to communicate both the data and its structure.
- Variables
- Objects
- Arrays
Common data exchange formats
- XML (Extensible Markup Language)
- JSON (JavaScript Object Notation)
- YAML (Yet Another Markup Language)
- CSV (Comma-Separated Values)
XML Example
<address>
<org>Coventry University</org>
<street>4 Gulson Road</street>
<city>Coventry</city>
<country>United Kingdom</country>
<postcode>CV1 5FB</postcode>
</address>
JSON Example
address {
"org": "Coventry University",
"street": "4 Gulson Road",
"city": "Coventry",
"country": "United Kingdom",
"postcode": "CV1 5FB",
}
YAML Example
address:
org: "Coventry University"
street: "4 Gulson Road"
city: "Coventry"
country: "United Kingdom"
postcode: "CV1 5FB"
CSV Example
"org", "street", "city", "country", "postcode"
"Coventry University", "4 Gulson Road", "Coventry", "United Kingdom", "CV1 5FB"
Why do we prefer the JSON format?
- Text-based
- Position independent
- Lightweight
- Interoperable with JavaScript Objects
Converting to and from JSON
const jsObj = {
firstname: 'John',
lastname: "Doe"
}
const jsonStr = JSON.stringify(jsObj)
const jsonStr2 = JSON.stringify(jsObj, null, 2)
const newObj = JSON.parse(jsonStr)
Sometimes called Data Scraping
Extracting data from a human-readable web page
Why use screen scraping?
Some data not available through an API
Usually a last resort
Sometimes companies scrape their own websites!
There are some challenges:
- Complex process
- Needs deconstructable URLs
- Success depends on the DOM not changing
- Most search results are paginated
Deconstructable URLs.
To access search results:
- Search term needs to be inserted into URL
To access resources:
- Product ID needs to be inserted into URL.
Here are some examples:
Amazon Book Search URL (javascript)
https://www.amazon.co.uk/s/ref=nb_sb_noss_2?url=search-alias%3Dstripbooks&field-keywords=javascript
https://www.amazon.co.uk/s/?url=search-alias%3Dstripbooks&field-keywords=javascript
Guardian Bookstore
http://bookshop.theguardian.com/catalogsearch/result/?q=javascript&order=relevance&dir=desc
http://bookshop.theguardian.com/catalogsearch/result/?q=javascript
BBC iPlayer search for history
http://www.bbc.co.uk/iplayer/search?q=history
Accessing resources.
Amazon Books
https://www.amazon.co.uk/JavaScript-Definitive-Guide-Guides/dp/0596805527/ref=sr_1_2?s=books&ie=UTF8&qid=1476384737&sr=1-2&keywords=javascript
https://www.amazon.co.uk/dp/0596805527
http://bookshop.theguardian.com/javascript-patterns.html
But the ISBN is 0596806752
http://www.bbc.co.uk/iplayer/episode/b019c88d/the-grammar-school-a-secret-history-episode-2
http://www.bbc.co.uk/iplayer/episode/b019c88d
Screen scraping techniques
- Browse to web page using Google Chrome
- Open the Developer tools (elements tab)
- Expand DOM structure and see what content it controls
- Uniquely identify the data
- Extract data using JQuery patterns
Module for screen scraper.
Process is messy
Needs updating when page structure changes
Need to isolate in its own module
Keep public interface simple
Public interface:
- Pass a search string and get a JavaScript array in return
- Pass a resource identifier and get a JavaScript resource back