Setting up Stanford Named Entity Recognizer on Ubuntu
Some simple NLP stuff, as an alternative to eg AWS Comprehend.
“Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names.”
The software is from here:
https://nlp.stanford.edu/software/CRF-NER.html
First make sure you have Java:
sudo apt-get default-jre
Then it’s quite straight forward:
wget https://nlp.stanford.edu/software/stanford-ner-4.0.0.zip
unzip stanford-ner-4.0.0.zip
cd stanford-ner-4.0.0
And to test:
$ sh ner.sh sample.txtThe/O fate/O of/O Lehman/ORGANIZATION Brothers/ORGANIZATION ,/O the/O beleaguered/O investment/O bank/O ,/O hung/O in/O the/O balance/O on/O Sunday/O as/O Federal/ORGANIZATION Reserve/ORGANIZATION officials/O and/O the/O leaders/O of/O major/O financial/O institutions/O continued/O to/O gather/O in/O emergency/O meetings/O trying/O to/O complete/O a/O plan/O to/O rescue/O the/O stricken/O bank/O ./OSeveral/O possible/O plans/O emerged/O from/O the/O talks/O ,/O held/O at/O the/O Federal/ORGANIZATION Reserve/ORGANIZATION Bank/ORGANIZATION of/ORGANIZATION New/ORGANIZATION York/ORGANIZATION and/O led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION York/ORGANIZATION Fed/ORGANIZATION ,/O and/O Treasury/ORGANIZATION Secretary/O Henry/PERSON M./PERSON Paulson/PERSON Jr./PERSON ./O
You can also just run it as Java, in this case with the “XML” option:
$ java -mx700m \
-cp "./stanford-ner.jar:./lib/*" \
edu.stanford.nlp.ie.crf.CRFClassifier \
-loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz \
-textFile FILENAME \
-outputFormat inlineXMLThe fate of <ORGANIZATION>Lehman Brothers</ORGANIZATION>, the beleaguered investment bank, hung in the balance on Sunday as <ORGANIZATION>Federal Reserve</ORGANIZATION> officials and the leaders of major financial institutions continued to gather in emergency meetings trying to complete a plan to rescue the stricken bank. Several possible plans emerged from the talks, held at the <ORGANIZATION>Federal Reserve Bank of New York</ORGANIZATION> and led by <PERSON>Timothy R. Geithner</PERSON>, the president of the <ORGANIZATION>New York Fed</ORGANIZATION>, and <ORGANIZATION>Treasury</ORGANIZATION> Secretary <PERSON>Henry M. Paulson Jr</PERSON>.
It also has a server mode — basically a socket send / response thing, which I totally would not expose on the Internet:
$ java -mx500m \
-cp stanford-ner.jar edu.stanford.nlp.ie.NERServer \
-port 9191 \
-loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz$ telnet localhost 9191
I know that Sherlock Homes lived in London, UK.I/O know/O that/O Sherlock/PERSON Homes/PERSON lived/O in/O London/LOCATION ,/O UK/LOCATION ./O