Tools to Extract Text from Webpages

Nov 14, 2020

As part of my ongoing explorations to understand NLP, I decided to polish up a project I’ve had laying about for years: iotdb-extract. It provides

a command line tool for figuring out the structure of web pages
a library of extraction rules for some popular websites (you’re encouraged to make your own and/or contribute back here)
a command line tool for extracting text using rules
a command line tool for checking to make sure your rules continue to work (there’s a lot of entropy in the web)
a Node library for extracting text, plus all the other functions above

The reason we need such a tool of course is that there’s tons of cruft in a web page besides the content which you don’t want to be feeding into your NLP algorithms.

The command line tools provide a number of output formats, including schema.org compatible formatted JSON-LD, YAML, JSON and JSON Lines.

Here’s an example of an extracted article:

If you like the project, please give a clip and/or star it on GitHub.

Tools to Extract Text from Webpages

Written by David Janes