Tools to Extract Text from Webpages

Nov 14, 2020

As part of my ongoing explorations to understand NLP, I decided to polish up a project I’ve had laying about for years: iotdb-extract. It provides

a command line tool for figuring out the structure of web pages
a library of extraction rules for some popular websites (you’re encouraged to make your own and/or contribute back here)
a command line tool for extracting text using rules
a command line tool for checking to make sure your rules continue to work (there’s a lot of entropy in the web)
a Node library for extracting text, plus all the other functions above

The reason we need such a tool of course is that there’s tons of cruft in a web page besides the content which you don’t want to be feeding into your NLP algorithms.

The command line tools provide a number of output formats, including schema.org compatible formatted JSON-LD, YAML, JSON and JSON Lines.

Here’s an example of an extracted article:

If you like the project, please give a clip and/or star it on GitHub.

Tools to Extract Text from Webpages

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by David Janes

No responses yet