Scraping the Web With Node.js

Before web based API’s became the prominent way of sharing data between services we had web scraping. Web scraping is a technique in data extraction where you pull information from websites.

There are many ways this can be accomplished. It can be done manually by copy and pasting data from a website, using specialized software, or building your own scripts to scrape data. In this tutorial, we will be showing you how to build a simple web scraper that gets some general movie information from IMDB. The technologies we will be using to accomplish this are:

  • NodeJS
  • ExpressJS: The Node framework that everyone uses and loves.
  • Request: Helps us make HTTP calls
  • Cheerio: Implementation of core jQuery specifically for the server (helps us traverse the DOM and extract data)

Setup

Our setup will be pretty simple. If you’re already familiar with NodeJS, go ahead and setup your project and include ExpressRequest and Cheerio as your dependencies.

Here is our  package.json file to get all the dependencies we need for our project.

With your  package.json file all ready to go, just install your dependencies with:

npm install

With that setup, let’s take a look at what we’ll be creating. In this tutorial, we will make a single request to IMDB and get:

  • name of a movie
  • release year
  • IMDB community rating

Once we compile this information, we will save it to a JSON file on our computer. Please see the code examples below for our setup. For this tutorial we will not have a front-end user interface and will rely on our command window to guide us.

Our Application

Our web scraper is going to be very minimalistic. The basic flow will be as follows:

  1. Launch web server
  2. Visit a URL on our server that activates the web scraper
  3. The scraper will make a request to the website we want to scrape
  4. The request will capture the HTML of the website and pass it along to our server
  5. We will traverse the DOM and extract the information we want
  6. Next, we will format the extracted data into a format we need
  7. Finally, we will save this formatted data into a JSON file on our machine

If you’ve been following our other NodeJS tutorials you should be pretty familiar with how to structure of an application works. For this tutorial, we will set the entire logic in our  server.js file.

Making the Request

Now that we have the boilerplate of the application done, let’s get into the fun stuff. We are now on Step 3, and that is making the request to the external website we would like to scrape.

The request function takes two parameters, the  URL and a  callback. For the URL parameter we will set the link of the IMDB movie we want to extract information from. In the callback, we will capture 3 parameters:  errorresponse, and  html.

Traversing the DOM

Movie Title

Now we are ready to start traversing the DOM and extracting information. First let’s get the movie name. We’ll head over to IMDB, open up Developer Tools and inspect the movie title element. We will be looking for a unique element that will help us single out the movie title. We notice that the  <h1> tag is our best bet for the movie title and that the class  header is unique. This seems like good starting spot.

Release Year

Now we are able to get the movie title. Next, we’ll repeat the process this time trying to find a unique element in the DOM for the movie release year. We notice that the year is also contained within the  <h1> tag and we also notice that the year is contained within the last element of the header. This gives us enough information to extract the year by writing this code:

Community Rating

Finally, to get the community rating, we repeat the above process. This time though, we notice that there is a very unique class name that will help us get the information really easily. The class name is  .star-box-giga-star. So let’s write some code to extract that information.

That’s all there is to it. If you wanted to extract more information, you can do so by repeating the steps we did above.

  1. Find a unique element or attribute on the DOM that will help you single out the data you need
  2. If no unique element exists on the particular tag, find the closest tag that does and set that as your starting point
  3. If needed, traverse the DOM to get to the data you would like to extract

Formatting and Extracting

Now that we have the data extracted, let’s format it and save it to our project folder. We have been storing our extracted data to a variable called json. Let’s save the data in this variable to our project folder. You’ll notice earlier that we required the  fs library. If you didn’t know what this was for, this library gives us access to our computer’s file system. Take a look at the code below to see how we write files to the file system

app.listen(‘8081’) console.log(‘Magic happens on port 8081’); exports = module.exports = app;

With this code in place you are set to scrape and save the scraped data. Let’s start up our node server, navigate to  http://localhost:8081/scrape and see what happens.

  • If everything went smoothly your browser should display a message telling you to check your command prompt.
  • When you check your command prompt you should see a message saying that your file was successfully written and that you should check your project folder.
  • Once you get to your project folder you should see a new file created called  output.json.
  • Opening this file, will give you a nicely formatted JSON document that will have the extracted data.

Congrats! You just wrote your first web scraper!

Putting It All Together

In this tutorial, we built a simple a web scraper that extracted movie information from an IMBD page. We covered using the Request and Cheerio libraries to make external requests and add jQuery functionality to our NodeJS server. We showed you how to traverse the DOM using jQuery in Node and how to write to the file system. I hope you enjoyed this article. Feel free to ask any questions below.

A Note on Web Scraping

Web scraping falls within a gray area of the law. Scraping data for personal use within limits is generally ok but you should always get permission from the website owner before doing so. Our example here was very minimalistic in a sense (we only made one request to IMDB) so that it does not interfere with IMDB’s operations. Please scrape responsibly.

Source:scotch.io

 

Please follow and like us:

Leave a reply

Your email address will not be published.