Skip to main content

Web Scraping REST API with Node, Express and Puppeteer

Web Scraping REST API with Node, Express and Puppeteer

Step 1. Create a Node project

Create a folder. Open the folder using a terminal.
Type: npm init and press enter
Answer or leave unanswered the questions asked by the program.
Your Node Project is ready.

Step 2. Install Puppeteer and Express 

Run npm install --save express in the terminal.
Run npm install --save puppeteer in the terminal.
This installs puppeteer as well as an instance of browser.

Step 3. Create Web Scraping Program

Create a file named app.js
Add the following lines to app.js
const express = require('express')

const scraper = require('./utils/scraper')

const app = express();

app.get('/reviews', (req, res) => {
scraper.extractReviews(req.query.url)
.then(data => {
res.status(200).json({ message: "success", data: data })
}).catch(err => res.status(500).json({ message: "Something went wrong. Could not fetch result." }))
});

app.listen(process.env.PORT || 3000, () =>
console.log('Example app listening on port!'),
);
Add a folder named utils
Add a file in utils named scraper.js
Add the following code in scraper.js
const puppeteer = require('puppeteer'); // import puppeteer

extractReviews = async (url) => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'networkidle2'});
const reviewCount = await page.evaluate(() => document.querySelector('span[itemprop="reviewCount"]').getAttribute('content'));
let reviewArray = [];
if (reviewCount > 0) {
url = url+"&pagenumber=0&RSort=1&csid=ITD&recordsPerPage="+reviewCount+"&body=REVIEWS#CustomerReviewsBlock"
await page.goto(url, { waitUntil: 'load' });
reviewArray = await page.evaluate(() => Array.from(document.querySelectorAll('.review')).map(review => ({ reviewTitle: review.querySelector('.rightCol blockquote h6').textContent, reviewComment: review.querySelector('.rightCol blockquote p').textContent, reviewRating: +review.querySelector('.leftCol .itemReview dd .itemRating strong').textContent, reviewDate: review.querySelector('.leftCol .reviewer dd:nth-of-type(2)').textContent, reviewer: review.querySelector('.leftCol .reviewer dd:nth-of-type(1)').textContent })));
}
await browser.close();
return { reviewCount: +reviewCount, reviewArray: reviewArray, url: url };
};

module.exports.extractReviews = extractReviews
Run node app.js in terminal to start the server on localhost.
In browser open the following URL to test your Scraping API: http://localhost:3000/reviews/?url=http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=3415697
Tadaaaaa!
If it did not work. Let me know in the comments.

Comments

Popular posts from this blog

Python - List - Append, Count, Extend, Index, Insert, Pop, Remove, Reverse, Sort

🐍 Advance List List is widely used and it's functionalities are heavily useful. Append Adds one element at the end of the list. Syntax list1.append(value) Input l1 = [1, 2, 3] l1.append(4) l1 Output [1, 2, 3, 4] append can be used to add any datatype in a list. It can even add list inside list. Caution: Append does not return anything. It just appends the list. Count .count(value) counts the number of occurrences of an element in the list. Syntax list1.count(value) Input l1 = [1, 2, 3, 4, 3] l1.count(3) Output 2 It returns 0 if the value is not found in the list. Extend .count(value) counts the number of occurrences of an element in the list. Syntax list1.extend(list) Input l1 = [1, 2, 3] l1.extend([4, 5]) Output [1, 2, 3, 4, 5] If we use append, entire list will be added to the first list like one element. Extend, i nstead of considering a list as one element, it joins the two lists one after other. Append works in the following way. Input l1 = [1, 2, 3] l1.append([4, 5]) Output...

Difference between .exec() and .execPopulate() in Mongoose?

Here I answer what is the difference between .exec() and .execPopulate() in Mongoose? .exec() is used with a query while .execPopulate() is used with a document Syntax for .exec() is as follows: Model.query() . populate ( 'field' ) . exec () // returns promise . then ( function ( document ) { console . log ( document ); }); Syntax for .execPopulate() is as follows: fetchedDocument . populate ( 'field' ) . execPopulate () // returns promise . then ( function ( document ) { console . log ( document ); }); When working with individual document use .execPopulate(), for model query use .exec(). Both returns a promise. One can do without .exec() or .execPopulate() but then has to pass a callback in populate.

Python Class to Calculate Distance and Slope of a Line with Coordinates as Input

🐍  Can be run on Jupyter Notebook #CLASS DESIGNED TO CREATE OBJECTS THAT TAKES COORDINATES AND CALCULATES DISTANCE AND SLOPE class Line:     def __init__(self,coor1,coor2):         self.coor1=coor1         self.coor2=coor2 #FUNCTION CALCULATES DISTANCE     def distance(self):         return ((self.coor2[0]-self.coor1[0])**2+(self.coor2[1]-self.coor1[1])**2)**0.5 #FUNCTION CALCULATES SLOPE         def slope(self):         return (self.coor2[1]-self.coor1[1])/(self.coor2[0]-self.coor1[0]) #DEFINING COORDINATES coordinate1 = (3,2) coordinate2 = (8,10) #CREATING OBJECT OF LINE CLASS li = Line(coordinate1,coordinate2) #CALLING DISTANCE FUNCTION li.distance() #CALLING SLOPE FUNCTION li.slope()