Three Streams Model for NLP
Introduction
In our previous paper, we described a document processing model for Natural Language Processing (NLP), and how to identify the position of Entities discovered within those documents. In this paper, we’ll discuss a “three streams” model of NLP Processing — each stream being a different type of useful and distinct information. We’ll also discuss some possibilities of applying existing standards for recording this information.
Three Streams Model
The three streams are:
- Document Metadata
- Entity Metadata
- Concepts
We call them streams, as they are emitted as the document is processed, i.e. you don’t get “here’s all the Document Metadata” all at once, you get it as it’s detected. Each detected item is given a confidence score, as for example, the title of a document could be detected in multiple different ways — derived from the filename, the <title> element, the <h1> element and so forth.
Document Metadata Stream
Document Metadata Stream is the usual thing we would consider metadata — the document title, the author’s name, the data of publishing, ISBN, and so forth. This is all described by the Dublin Core standard. We would also add here the hash of the Raw, Decoded and Normalized Documents, the character set, and other technical bits of info.
Entity Metadata Stream
Entity Metadata Stream is data attached to “entities” within the document. Entities always have a beginning and end position which allow them to be located. Entities include things like Part of Speech (eg is it a Noun, a Verb), Named Entities (eg. someone’s name, a place name, a time and date), a Sentence, Paragraph, Chapter, the Sentiment, and so forth.
The “Thoughts on identifying positional metadata inside documents” paper describes how to create positional metadata.
There is likely work to be done standardizing this type of a metadata across different platforms. AWS and Stanford for example use different token IDs for representing the “part of speech” tokens. However, this is out of scope for here, but keep this in mind if allowing multiple AI backends.
Concept Stream
The Concept Stream identifies concepts that can be represented as structured data found in the document. A concept will be emitted only once per document, but it may have multiple locations where it can be found.
One example of a concept would be the fictional person “John Watson”. This concept need only be emitted once per document, saying essentially “this character is in this document”. Contrast this to, for example, the Named Entities “Watson”, “John C. Watson”, “John Watson” which would all map back to this single concept.
The Watson in the phrase “Watson, come here I want to see you” would map to a different concept of Watson.
Concepts can potentially be much more complex than Person, Place, Organization and so on, such as “X gave Y to Z” and so forth. However, that type of analysis is much more difficult to do.
Schema.org provides an excellent starting point for semantically describing concepts. We believe, but less strongly, that Wikipedia provides a good place for getting unique concept identifiers (URLs).
Appendices
Definitions
- Document Metadata Stream — a stream of metadata information about a document
- Entity Metadata Stream — a stream of metadata about entities within a document, every entity having a definitive beginning and end
- Concept Stream — a stream of concepts that exist in the document; a single concept may appear (and be referenced) multiple times per document
- Entity Metadata — metadata associated with a particular entity in a document, for example that “building” is a noun, or that it is in position 1238
- Positional Metadata — metadata that indicated the position of an entity within a document
- Cardinal Position — an absolute position of an entity within a document
- Ordinal Position — the order position of an entity within a document
- Normalization Path — the operations need to take a raw stream of bytes to a text string that’s ready to be operated on by NLP functions
- NLP Path — the operations that are to be defined on the Normalized Document to do Entity discovery and so forth.
- Normalized Document — a string that’s suitable for NLP processing
- Raw Document — the raw bytes read from the filesystem (or wherever)
- Decoded Document — the raw bytes transformed into a string respecting character encoding. In PDFs, Audio, Video and so on this is really just the same as the Raw Document.