At times we must have come across a situation wherein we had to scrap a website and process the data in some way, like Storing, Running analytics or Processing their content
I this tutorial I will be showing you a way in which we can extract SEO related contents of a website. If you have a different use case, I will also include the github repository
which you can clone and do changes as per your requirement.
1. We will start with creating a basic Node-Express server.
const http = require('http'); const express = require('express'); const route = require('./routes/main'); let app = express(); let server = http.createServer(app); app.use('/',route); server.listen(3030, function(){ console.log('Connected'); })
Here using Nodejs's HTTP package, we have created a server, which would be running at port 3030 and have also attached our route which will redirect incoming requests.
2. Route incoming requests to our function
let express = require('express'); let router = express.Router(); let controller = require('../controller/main'); router.get('/get', controller.decodeHTML); module.exports = router;
Here, using Express' Router function, the routes are redirected to our function where main logic resides
3. Parse and Extract data from webpages
Now that we have redirected our requests to our main function, we will start with fetching the raw html content of the website.
In function decodeHTML the content is fetched using request package.
let responseObject = await requester(url);
Once we have extracted the content of the website, we can do processing as our requirements are. But for our current use case, we can extract data using regex statements. If you need to check exact regex statements, you can go through the attached repository once.
Github Repository URL https://github.com/iatsi/seo-scanner
Project hosted on https://woofh.com
Some example usages ::
https://woofh.com/get?url=https://www.flipkart.com
https://woofh.com/get?url=https://www.google.com
https://woofh.com/get?url=https://www.amazon.in
https://woofh.com/get?url=https://iatsi.thecodeground.in