Sign Up for Free

RunKit +

Try any Node.js package right in your browser

This is a playground to test code. It runs a full Node.js environment and already has all of npm’s 400,000 packages pre-installed, including a-extractor with all npm packages installed. Try it out:

var aExtractor = require("a-extractor")

This service is provided by RunKit and is not affiliated with npm, Inc or the package authors.

a-extractor v2.0.1

Article content extraction database

📃 Article extractor

Database of expressions used for extracting content from blogs and articles.

NPM Version NPM Downloads Build Status Standard Style Guide

The main database is JSON5 format, a strict subset of Javascript, also available as a normal JSON, for convenience.

The extraction expressions are Cheerio, similar with jQuery.

The targeted information is:

  • the author
  • the date when the article was written
  • and of course, the article text, as clean as possible

This project is designed to be used with Clean-Mark, but you can use it however you want.

86 domains available

  • abcnews.go.com
  • aeon.co
  • agroinfo.ro
  • arenait.net
  • arstechnica.com
  • articles.latimes.com
  • artsy.net
  • bbc.com
  • beta.theglobeandmail.com
  • bigthink.com
  • bindiribli.ro
  • bossfeed.net
  • businessinsider.com
  • collectivelyconscious.net
  • curentul.info
  • dailymail.co.uk
  • deepdotweb.com
  • digi24.ro
  • earthsky.org
  • edition.cnn.com
  • engadget.com
  • express.co.uk
  • farnamstreetblog.com
  • fastcompany.com
  • finesociety.ro
  • firstpost.com
  • foxnews.com
  • galacticconnection.com
  • gandeste.org
  • gazetadambovitei.ro
  • gnosticwarrior.com
  • hackread.com
  • hbr.org
  • hotnews.ro
  • howtogeek.com
  • huffingtonpost.com
  • info.localytics.com
  • infoalert.ro
  • irishmirror.ie
  • isgp-studies.com
  • jamesclear.com
  • jurnalul.ro
  • latimes.com
  • life.ro
  • mashable.com
  • merckmanuals.com
  • money.cnn.com
  • nautil.us
  • nbcnews.com
  • ncbi.nlm.nih.gov
  • neonnettles.com
  • news.com.au
  • newscientist.com
  • newyorker.com
  • nytimes.com
  • nzherald.co.nz
  • observator.tv
  • pri.org
  • qz.com
  • romaniaa.ro
  • rt.com
  • rts.earth
  • smh.com.au
  • start-up.ro
  • stiri.tvr.ro
  • stirileprotv.ro
  • techcrunch.com
  • techradar.com
  • telegraph.co.uk
  • theatlantic.com
  • theguardian.com
  • theliberal.ie
  • thenextweb.com
  • theverge.com
  • thrillist.com
  • torrentfreak.com
  • usatoday.com
  • usnews.com
  • vox.com
  • wakingtimes.com
  • wall-street.ro
  • washingtonpost.com
  • weforum.org
  • wsj.com
  • yahoo.com
  • ziare.com

Important

Clean-Mark already has algorithms to extract most of the info, if the website is SEO friendly, eg: it respects schema.org/Article, or Microformats, or the Open Graph protocol.
But it's not a perfect tool 🤖 and it needs help from us humans 🙄

Contributions

We ❤️ contributions !!!

Want to report a bug, request a feature, or contribute? Things can only be contributed via the A-Extractor GitHub repository.

The "fork-and-pull" Git workflow:

  1. Fork the repo on GitHub
  2. Clone the project to your own machine
  3. Work on your fork
    1. Make your changes and additions
    2. Change or add tests if needed
    3. Run tests and make sure they pass
    4. Add changes to README.md if needed
  4. Commit changes to your own branch
  5. Make sure you merge the latest from "upstream" and resolve conflicts if there is any
  6. Push your work back up to your fork
  7. Submit a Pull request so that we can review your changes

License

MIT © Cristi Constantin.

RunKit is a free, in-browser JavaScript dev environment for prototyping Node.js code, with every npm package installed. Sign up to share your code.
Sign Up for Free