Sign Up for Free

RunKit +

Try any Node.js package right in your browser

This is a playground to test code. It runs a full Node.js environment and already has all of npm’s 400,000 packages pre-installed, including dcrawler with all npm packages installed. Try it out:

var dcrawler = require("dcrawler")

This service is provided by RunKit and is not affiliated with npm, Inc or the package authors.

dcrawler v0.0.8

DCrawler is a distribited web spider written in Nodejs and queued with Mongodb. It gives you the full power of jQuery to parse big pages as they are downloaded, asynchronously. Simplifying distributed crawler!

node-distributed-crawler

Features

  • Distributed crawler
  • Configurable url parser and data parser
  • jQuery selector using cheerio
  • Parsed data insertion in Mongodb collection
  • Domain wise interval configuration in distributed enviroment
  • node 0.8+ support

Note: update to latest version (0.0.4+), don't use 0.0.1

I am actively updating this library, for any feature suggestion or git fork request are welcomed :)

Installation

$ npm install dcrawler

Usage

var DCrawler = require("dcrawler");

var options = {
    mongodbUri:     "mongodb://0.0.0.0:27017/crawler-data",
    profilePath:    __dirname + "/" + "profile"
};
var logs = {
    dbUri:      "mongodb://0.0.0.0:27017/crawler-log",
    storeHost:  true
};
var dc = new DCrawler(options, logs);
dc.start();

Note: mongodb connection uri (mongodbUri & dbUri) should be same (queueing of urls should be centeralized)

The DCrawler takes options and log options construcotr:

  1. options with following porperties __*__:
  • mongodbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler') *
  • profilePath: Location of profile directory which contains config files. (Eg: /home/crawler/profile) *
  1. logs to store logs in centrelized location using winston-mongodb with following porperties:
  • dbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler')
  • storeHost: Boolean, true or false to store workers host name or not in log collection.

Note: logs is required when you want to store centralize logs in mongodb, if you don't want to store logs no need to pass logOptions variable in DCrawler constructor

var dc = new DCrawler(options);

Create config file for each domain inside profilePath directory. Check example profile example.com, contains config with following porperties:

  • collection: Name on collection to store parsed data in mongodb. (Eg: 'products') *
  • url: Url to start crawling. String or Array of url. (Eg: 'http://example.com' or ['http://example.com']) *
  • interval: Interval between request in miliseconds. Default is 1000 (Eg: For 2 secods interval: 2000)
  • followUrl: Boolean, true or false to fetch further url from the crawled page and crawl that url as well.
  • resume: Boolean, true or false to resume crawling from previous crawled data.
  • beforeStart: Function to execute before start crawling. Function has config param which contains perticular profile config object. Example function:
beforeStart: function (config) {
    console.log("started crawling example.com");
}
  • parseUrl: Function to get further url from crawled page. Function has error, response object and $ jQuery object param. Function returns Array of url string. Example function:
parseUrl: function (error, response, $) {
    var _url = [];
    
    try {
        $("a").each(function(){
            var href = $(this).attr("href");
            if (href && href.indexOf("/products") > -1) {
                if (href.indexOf("http://example.com") === -1) {
                    href = "http://example.com/" + href;
                }
                _url.push(href);
            }
        )};
    } catch (e) {
        console.log(e);
    }
    
    return _url;
}
  • parseData: Function to exctract information from crawled page. Function has error, response object and $ jQuery object param. Function returns data Object to insert in collection . Example function:
parseData: function (error, response, $) {
    var _data = null;
    
    try {
        var _id = $("h1#productId").html();
        var name = $("span#productName").html();
        var price = $("label#productPrice").html();
        var url = response.uri;
        
        _data = {
            _id: _id,
            name: name,
            price: price,
            url: url
        }
    } catch (e) {
        console.log(e);
    }
    
    return _data;
}
  • onComplete: Function to execute on completing crawling. Function has config param which contains perticular profile config object. Example function:
onComplete: function (config) {
    console.log("completed crawling example.com");
}

Chirag (blikenoother -[at]- gmail [dot] com)

RunKit is a free, in-browser JavaScript dev environment for prototyping Node.js code, with every npm package installed. Sign up to share your code.
Sign Up for Free