Sign Up for Free

RunKit +

Try any Node.js package right in your browser

This is a playground to test code. It runs a full Node.js environment and already has all of npm’s 400,000 packages pre-installed, including html-stemmer with all npm packages installed. Try it out:

var htmlStemmer = require("html-stemmer")

This service is provided by RunKit and is not affiliated with npm, Inc or the package authors.

html-stemmer v1.0.5

Extracts all [porter2] stemmed words from an HTML file, with the goal of aiding web-based NLP

html-stemmer

Main repo: https://github.com/marcelpuyat/html-stemmer

Overview

Extracts all words from a file, filtering out HTML tags, stemming using Porter2 and filtering out stop words.

Install

npm install html-stemmer

Usage

var htmlStemmer = require('html-stemmer');

htmlStemmer.initialize();

htmlStemmer.getStemmedWords('filename', function(stemmedWordsArray) {
    console.log(stemmedWordsArray); // Prints out all stemmed words in 'filename'
});

Documentation

initialize(options)

Initializes the stemmer, using default options when not specified.

Example:

htmlStemmer.initialize({
  includeTags: true,
  caseSensitive: true,
  delimiter: /[^A-Za-Z0-9]+/gi
});

Options:

Note that all of these are optional

  • includeTags - true or false. Filters out html tags (i.e. '<body>' is deleted) when false. false by default
  • filters - An object that maps regular expressions to what they should be replaced by.
    // Example that filters '&apos;' into an apostrophe and '&quot;' into a quotation mark
    filters = {};
    
    filters[/&apos;/gi] = '\'';
    filters[/&quot;/gi] = '"';
    
    htmlStemmer.initialize({
      filters: filters
    });
    
  • stopWords - true or false. Excludes stop words (i.e. 'for', 'to', etc.) from final array returned by getStemmedWords if true. List of stop words used is available here. true by default.
  • caseSensitive - true or false. Converts all characters to lowercase when false. false by default.
  • stemmed - true or false. Stems each word using Porter2 when true. true by default.
  • delimiter - A RegExp delimiter that is used to split the data into tokens. By default, /[^A-Za-z]+/gi is used.

getStemmedWords(filePath, callbackFn)

Returns an array containing all stemmed words according to the options specified in initialize. Because file reading is done asynchronously, a callback function is required to get the array of stemmed words.

Example:

htmlStemmer.getStemmedWords('filename', function(stemmedWordsArray) {
  console.log(stemmedWordsArray); // Prints out all stemmed words in 'filename'
});
RunKit is a free, in-browser JavaScript dev environment for prototyping Node.js code, with every npm package installed. Sign up to share your code.
Sign Up for Free