Tools to Extract Text from Webpages

As part of my ongoing explorations to understand NLP, I decided to polish up a project I’ve had laying about for years: iotdb-extract. It provides
- a command line tool for figuring out the structure of web pages
- a library of extraction rules for some popular websites (you’re encouraged to make your own and/or contribute back here)
- a command line tool for extracting text using rules
- a command line tool for checking to make sure your rules continue to work (there’s a lot of entropy in the web)
- a Node library for extracting text, plus all the other functions above
The reason we need such a tool of course is that there’s tons of cruft in a web page besides the content which you don’t want to be feeding into your NLP algorithms.
The command line tools provide a number of output formats, including schema.org compatible formatted JSON-LD, YAML, JSON and JSON Lines.
Here’s an example of an extracted article:
If you like the project, please give a clip and/or star it on GitHub.