PythonでStanford Core NLPの固有表現抽出(NER)ができない時のメモ

どうもー
自分用のメモです。
と言っても、これに悩まされたのは大分前なんだけど。
もしかしたら困っている方がいるのかもしれないので書きます。

ついでに、結論から言うと、

"ner.useSUTime":False,

1	"ner.useSUTime":False,

を書けば、うまくいくと思います。

NERを使う

想定読者

NERした時に、なんか変なエラー出るなー。なんでやねん。った方用です。
前提として、stanford core nlp のインストール、およびサーバは立ち上げずみ (これわからん人はこちら) と思ってます。

設定

python でコード書きますが、他の何かでも、共通する部分はあるので、うまく適応してください。
pycorenlp と言うライブラリを使います。

python : 3.6.0
stanford-core-nlp : stanford-corenlp-full-2018-02-27
java : version 9
pycorenlp : 0.3.0

解決法

黄色いラインに注目です。
property を選択する部分で、nerの他に、nerのオプションであるuseSUTimeをFalse、もしくは0に設定してください。

from pycorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP("http://localhost:9000")
text = "Hello, My name is Hoge. I live in Tokyo."

prop = {"annotators":"tokenize, pos, ner", 
        "ner.useSUTime":False, 
        "outputFormat":"json"}

TokenizedText = nlp.annotate(text, properties=prop)

from pycorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP("http://localhost:9000")

text = "Hello, My name is Hoge. I live in Tokyo."

prop = {"annotators":"tokenize, pos, ner",

"ner.useSUTime":False,

"outputFormat":"json"}

TokenizedText = nlp.annotate(text, properties=prop)

で。出力を見てみると、

for i, line in enumerate(tokenizedText['sentences']):                      
    sentence = [word['word'] for word in line['tokens']]　
    pos = [word['pos'] for word in line['tokens']]
    ner = [word['ner'] for word in line['tokens']]
    print("-----text:{}-----".format(i))
    l = ["{0}({1})[{2}]".format(w,p,n) for w, p, n in zip(sentence, pos, ner)]　#適当に整形
    print(l)  


# output
#-----text:0-----
#['Hello(UH)[O]', ',(,)[O]', 'my(PRP$)[O]', 'name(NN)[O]', 'is(VBZ)[O]', 'Hoge(NNP)[PERSON]', '.(.)[O]']
#-----text:1-----
#['I(PRP)[O]', 'live(VBP)[O]', 'in(IN)[O]', 'Tokyo(NNP)[CITY]', '.(.)[O]']

for i, line in enumerate(tokenizedText['sentences']):

sentence = [word['word'] for word in line['tokens']]　

pos = [word['pos'] for word in line['tokens']]

ner = [word['ner'] for word in line['tokens']]

print("-----text:{}-----".format(i))

l = ["{0}({1})[{2}]".format(w,p,n) for w, p, n in zip(sentence, pos, ner)]　#適当に整形

print(l)

# output

#-----text:0-----

#['Hello(UH)[O]', ',(,)[O]', 'my(PRP$)[O]', 'name(NN)[O]', 'is(VBZ)[O]', 'Hoge(NNP)[PERSON]', '.(.)[O]']

#-----text:1-----

#['I(PRP)[O]', 'live(VBP)[O]', 'in(IN)[O]', 'Tokyo(NNP)[CITY]', '.(.)[O]']

整形はちょっと見づらいですけど、
Hoge と言う名前を表す部分はちゃんと、PERSON の NER が取れてますし、
Tokyo と言う場所を表す部分は、CITY が取れました。

他のタグは、

For English, by default, this annotator recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities (12 classes). Adding the regexner annotator and using the supplied RegexNER pattern files adds support for the fine-grained and additional entity classes EMAIL, URL, CITY, STATE_OR_PROVINCE, COUNTRY, NATIONALITY, RELIGION, (job) TITLE, IDEOLOGY, CRIMINAL_CHARGE, CAUSE_OF_DEATH (11 classes) for a total of 23 classes. Named entities are recognized using a combination of three CRF sequence taggers trained on various corpora, including CoNLL, ACE, MUC, and ERE corpora. Numerical entities are recognized using a rule-based system.

とのこと。他に色々あるってことがわかればそれで良いでしょう。
メモは以上。

Stanford CoreNLPのセットアップ

続いて、stanford corenlp を使い始めたい人向けに、セットアップの方法を書いておきます。

stanford corenlpとは

パーズとか、固有表現抽出とか、なんかすごいことやってくれる自然言語処理ツールです。

python からの使用方法

pythonから使う場合にはサーバを立ち上げる必要があるので、まずそちらをやっていきます。

ページにいく
ダウンロードする。
zipファイルを解凍する。好きなところに保存 (ここでは、aaa/hogeにします)
ディレクトリ移動
cd aaa/hoge/stanford-corenlp-full-2018-02-27 #ディレクトリ名が異なる場合は随時変更してください
1
cd aaa/hoge/stanford-corenlp-full-2018-02-27 #ディレクトリ名が異なる場合は随時変更してください
java
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer #javaが入ってない場合はインストールしてください
1
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer #javaが入ってない場合はインストールしてください

[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - setting default constituency parser
[main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz
[main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead
[main] INFO CoreNLP - to use shift reduce parser download English models jar from:
[main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html
[main] INFO CoreNLP -     Threads: 4
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000

[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---

[main] INFO CoreNLP - setting default constituency parser

[main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz

[main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead

[main] INFO CoreNLP - to use shift reduce parser download English models jar from:

[main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html

[main] INFO CoreNLP - Threads: 4

[main] INFO CoreNLP - Starting server...

[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000

こんな感じになったら準備おっけーです。

次に、python側ですね。

pycorenlp をインストールする
pip3 install pycorenlp
1
pip3 install pycorenlp

コード書く

from pycorenlp import StanfordCoreNLaP             
 
nlp = StanfordCoreNLP("http://localhost:9000")
prop = {"annotators":"tokenize, pos", "outputFormat":"json"}

text = "Hello, my name is Hoge. I live in Japan."

tokenizedText = nlp.annotate(text, properties=prop)
print(tokenizedText)

from pycorenlp import StanfordCoreNLaP

nlp = StanfordCoreNLP("http://localhost:9000")

prop = {"annotators":"tokenize, pos", "outputFormat":"json"}

text = "Hello, my name is Hoge. I live in Japan."

tokenizedText = nlp.annotate(text, properties=prop)

print(tokenizedText)

コード実行
上記を実行すると以下のような形で出力されます。(整形してるのでそのまんまではないですが)

#output
{
	"sentences": [
		{
			"index": 0,
			"tokens": [
				{
					"index": 1,
					"word": "Hello",
					"originalText": "Hello",
					"characterOffsetBegin": 0,
					"characterOffsetEnd": 5,
					"pos": "UH",
					"before": "",
					"after": ""
				},
				{
					"index": 2,
					"word": ",",
					"originalText": ",",
					"characterOffsetBegin": 5,
					"characterOffsetEnd": 6,
					"pos": ",",
					"before": "",
					"after": " "
				},
				{
					"index": 3,
					"word": "my",
					"originalText": "my",
					"characterOffsetBegin": 7,
					"characterOffsetEnd": 9,
					"pos": "PRP$",
					"before": " ",
					"after": " "
				},
				{
					"index": 4,
					"word": "name",
					"originalText": "name",
					"characterOffsetBegin": 10,
					"characterOffsetEnd": 14,
					"pos": "NN",
					"before": " ",
					"after": " "
				},
				{
					"index": 5,
					"word": "is",
					"originalText": "is",
					"characterOffsetBegin": 15,
					"characterOffsetEnd": 17,
					"pos": "VBZ",
					"before": " ",
					"after": " "
				},
				{
					"index": 6,
					"word": "Hoge",
					"originalText": "Hoge",
					"characterOffsetBegin": 18,
					"characterOffsetEnd": 22,
					"pos": "NNP",
					"before": " ",
					"after": ""
				},
				{
					"index": 7,
					"word": ".",
					"originalText": ".",
					"characterOffsetBegin": 22,
					"characterOffsetEnd": 23,
					"pos": ".",
					"before": "",
					"after": " "
				}
			]
		},
		{
			"index": 1,
			"tokens": [
				{
					"index": 1,
					"word": "I",
					"originalText": "I",
					"characterOffsetBegin": 24,
					"characterOffsetEnd": 25,
					"pos": "PRP",
					"before": " ",
					"after": " "
				},
				{
					"index": 2,
					"word": "live",
					"originalText": "live",
					"characterOffsetBegin": 26,
					"characterOffsetEnd": 30,
					"pos": "VBP",
					"before": " ",
					"after": " "
				},
				{
					"index": 3,
					"word": "in",
					"originalText": "in",
					"characterOffsetBegin": 31,
					"characterOffsetEnd": 33,
					"pos": "IN",
					"before": " ",
					"after": " "
				},
				{
					"index": 4,
					"word": "Japan",
					"originalText": "Japan",
					"characterOffsetBegin": 34,
					"characterOffsetEnd": 39,
					"pos": "NNP",
					"before": " ",
					"after": ""
				},
				{
					"index": 5,
					"word": ".",
					"originalText": ".",
					"characterOffsetBegin": 39,
					"characterOffsetEnd": 40,
					"pos": ".",
					"before": "",
					"after": ""
				}
			]
		}
	]
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

#output

{

"sentences": [

{

"index": 0,

"tokens": [

{

"index": 1,

"word": "Hello",

"originalText": "Hello",

"characterOffsetBegin": 0,

"characterOffsetEnd": 5,

"pos": "UH",

"before": "",

"after": ""

{

"index": 2,

"word": ",",

"originalText": ",",

"characterOffsetBegin": 5,

"characterOffsetEnd": 6,

"pos": ",",

"before": "",

"after": " "

{

"index": 3,

"word": "my",

"originalText": "my",

"characterOffsetBegin": 7,

"characterOffsetEnd": 9,

"pos": "PRP$",

"before": " ",

"after": " "

{

"index": 4,

"word": "name",

"originalText": "name",

"characterOffsetBegin": 10,

"characterOffsetEnd": 14,

"pos": "NN",

"before": " ",

"after": " "

{

"index": 5,

"word": "is",

"originalText": "is",

"characterOffsetBegin": 15,

"characterOffsetEnd": 17,

"pos": "VBZ",

"before": " ",

"after": " "

{

"index": 6,

"word": "Hoge",

"originalText": "Hoge",

"characterOffsetBegin": 18,

"characterOffsetEnd": 22,

"pos": "NNP",

"before": " ",

"after": ""

{

"index": 7,

"word": ".",

"originalText": ".",

"characterOffsetBegin": 22,

"characterOffsetEnd": 23,

"pos": ".",

"before": "",

"after": " "

}

]

{

"index": 1,

"tokens": [

{

"index": 1,

"word": "I",

"originalText": "I",

"characterOffsetBegin": 24,

"characterOffsetEnd": 25,

"pos": "PRP",

"before": " ",

"after": " "

{

"index": 2,

"word": "live",

"originalText": "live",

"characterOffsetBegin": 26,

"characterOffsetEnd": 30,

"pos": "VBP",

"before": " ",

"after": " "

{

"index": 3,

"word": "in",

"originalText": "in",

"characterOffsetBegin": 31,

"characterOffsetEnd": 33,

"pos": "IN",

"before": " ",

"after": " "

{

"index": 4,

"word": "Japan",

"originalText": "Japan",

"characterOffsetBegin": 34,

"characterOffsetEnd": 39,

"pos": "NNP",

"before": " ",

"after": ""

{

"index": 5,

"word": ".",

"originalText": ".",

"characterOffsetBegin": 39,

"characterOffsetEnd": 40,

"pos": ".",

"before": "",

"after": ""

}

]

}

]

}

以上。