Tools to Extract Text from Webpages

David Janes
Nov 14, 2020
(source)

As part of my ongoing explorations to understand NLP, I decided to polish up a project I’ve had laying about for years: iotdb-extract. It provides

  • a command line tool for figuring out the structure of web pages
  • a library of extraction rules for some popular websites (you’re encouraged to make your own and/or contribute back here)
  • a command line tool for extracting text using rules
  • a command line tool for checking to make sure your rules continue to work (there’s a lot of entropy in the web)
  • a Node library for extracting text, plus all the other functions above

The reason we need such a tool of course is that there’s tons of cruft in a web page besides the content which you don’t want to be feeding into your NLP algorithms.

The command line tools provide a number of output formats, including schema.org compatible formatted JSON-LD, YAML, JSON and JSON Lines.

Here’s an example of an extracted article:

If you like the project, please give a clip and/or star it on GitHub.

--

--