Thoughts on identifying positional metadata inside documents
Introduction
“Document Metadata” is information such as the title of a document, the name of the author, the date of publishing and so forth. This is widely known, well understood, and standards*exist to record it.
In this paper, we’ll introduce “Positional Metadata” which records where a particular piece of information exists within a document. The intention is to explore the idea, its benefits and costs, and whether it’s worth putting a lot of effort into recording. We’ll also introduce some useful terminology for talking about the stages of preparing documents for NLP.
We’ll include audio and video as document types here, as well as obviously text.
*e.g. Dublin Core
↠ Questions to be Answered
- Is it worth putting a lot of effort into finding out the structure of a document, so that information may be contextualized (e.g. “Boromir” appears 857 times in Book 2 of The Fellowship of The Ring, on pages …)? Or is it the opposite — is it enough to know that “Boromir” is in the LoTR books and if you need to know more you can just go search.
- Should or can the structure of the document be hierarchically organized, e.g. the LoTR is divided into three Volumes, each Volume has two Books (but numbered 1 through 6, so it’s almost a separate top level organizing), each Book is divided into Chapters, each Chapter is divided into Paragraphs, and each Paragraph is divided into Sentences. Note however that each Volume is also divided into Pages, but Pages can split Sentences and Paragraphs.
- Can we just mostly ignore positional cardinals (e.g. “Napoleon Bonaparte” appears at positions 37830–37848) and just use positional ordinals (e.g. words* 4560–4561)?
*Actually, “entity” as for example punctuation gets recorded as an entity. Like many things in NLP the more you look at it the fuzzier it becomes, and the exact value depends on how you define concepts like “entity”.
↠ Givens
- NLP algorithms require (or would strongly prefer) access to grammatical units of text, paragraphs or sentences. It does not make sense to send half a sentence for syntax analysis.
- All markup — HTML, for example, has to be removed before analysis
- Metadata — for example, the leading sections of Project Gutenberg files, should be removed (and/or otherwise processed) before analysis
- We are trying to create automated systems, so we want to minimize hand tuning for Corpuses (e.g. specifying that these particular documents are double spaced)
Document Types
↠ Text
Consider the transformation of a very hypothetical Project Gutenberg document into entities. The following transformations need to be taken, which we divide into two phases, the Normalization Path and the NLP Path.
The Normalization Path looks like:
- Start with bytes, the Raw Document
- Convert these bytes to UTF-8*. At this point you will (almost certainly) have fewer “things” in your string, as multiple bytes encode to a single character position beyond the ASCII character range. This is called the Decoded Document.
- Possibly convert CRLF to LF, to normalize to *nix formatted documents
- Possibly convert LFLF to LF, if the document is double spaced**.
- Remove the preamble. Note that there’s a whole “extract document metadata” component that happens here also, but we’ll put it aside for brevity
- At this point we have a character string that we are going to do NLP processing on, called the Normalized Document. Any given position in this string almost certainly doesn’t correspond to the same one in the Decoded Document.
The NLP Path are the NLP functions we want to perform on the Normalized Document. Obviously this is cooked to preference, but for example Named Entity Recognition and Syntax Detection. The important thing about NLP Path is that it doesn’t output a new Document but rather what we’re calling Entity Metadata — information about parts of the document.
Usually this data will come back from whatever service produced it with the absolute position of the entity, relative to the beginning of the text string you feed it. In the case of AWS Comprehend, it can only process so many bytes of data at once, so you’ll need to chunk it in a meaningful way — paragraphs being a natural choice.
*In fact, Gutenberg specifies the charset of the document within the document, so to really decode one of these documents you need to decode the first few thousand bytes as ASCII, look for the encoding, then decode the document for real in the proper encoding
**Even if you believe this is a waste of time, you can’t avoid that sometimes information has to be edited out of a document before NLP processing, for example perhaps the footnotes
↠ HTML
The HTML* Normalization Path is quite interesting because the HTML embeds information that we may want to know about the document — titles, paragraphs, and so forth. However, this information exists in the Decoded Document (the HTML), not in the Normalized Document (the text extracted from the document)! This provides us a strong argument that we want to compute locations based on the Decoded Document, not on the Normalized Document.
Also note that HTML documents “in the wild” have a lot of cruft that needs to be removed before being fed to NLP algorithms — advertisements, links to other articles, branding and so forth.
*Most of these arguments apply to Markdown as well
On *nix systems, PDFs can be converted to text using the command pdftotext*. It works best with single column documents, though it makes some heroic efforts to detail with multicolumn and partitioned pages that are not always successful. We’ll ignore those as well as documents requiring OCR and just focus on PDFs that translate well.
The translated output is divided into pages, separated by control-L / page feed. Beyond the page structure (at least doing it this way), you’re on your own.
*From poppler-utils.
↠ Audio and Video
Audio can be transcribed to text using AI services such as AWS Transcribe or Google’s Speech-to-Text. This text will have timestamp information to indicate where in the original file.
In this particular case, we should consider the raw audio / video file to be the Decoded Document, and the Normalized Document to be the pure text. After the NLP Path is executed, the time-based Positional information needs to be attached to Entities.
The main take-away here is that Positional information sometimes has to be recorded as time rather than position offset.
Conclusions
The initial document, after decoding for character set (if necessary), is called the Decoded Document. After the first pass of processing, to reduce to text and to remove irrelevant or out-of-band information, we get the Normalized Document.
If possible, extract the structure of the document from the Normalized Document, as it is usually explicitly marked by the creator.
NLP processing is done on the Normalized Document. As entities are discovered, the Position Metadata should be recorded so they can be found in the Decoded Document. This means for text documents, a character index; for audio and video, a time offset*. For PDFs or other proprietary formats, this becomes a more complex problem, but for example the page number combined with the word offset might work.
In all cases, one Entity Position collating bigger than another should strictly mean it’s further into the document. There is no requirement that Entity Position need be an integer.
Code that produces the Normalized Document should be able to answer the question “for the Entity at position X, where is it in the Decoded Document?”.
Information about hierarchy can be captured entirely independent of entity detection. Documents can be organized in multiple overlapping hierarchies. Hierarchy information can be represented using the same Positional system that Entities use, and thus Entities can be organized hierarchical “as needed”.
*There’s a temptation to say “in seconds”, but there’s a lot of standardization in this industry which I’m not familiar with that should be respected.
↠ Definitions
- Entity Metadata — metadata associated with a particular entity in a document, for example that “building” is a noun, or that it is in position 1238
- Positional Metadata — metadata that indicated the position of an entity within a document
- Cardinal Position — an absolute position of an entity within a document
- Ordinal Position — the order position of an entity within a document
- Normalization Path — the operations need to take a raw stream of bytes to a text string that’s ready to be operated on by NLP functions
- NLP Path — the operations that are to be defined on the Normalized Document to do Entity discovery and so forth.
- Normalized Document — a string that’s suitable for NLP processing
- Raw Document — the raw bytes read from the filesystem (or wherever)
- Decoded Document — the raw bytes transformed into a string respecting character encoding. In PDFs, Audio, Video and so on this is really just the same as the Raw Document.