ngram言語モデルについてまとめる (Interpolated Kneser–Ney smoothing)

eieito.hatenablog.com

前回、NLTKで動かしてみた Interpolated Kneser–Ney smoothing (長いので以降 Interpolated KNと略します) をpythonで実装してみました。

詳細は gist にアップロードした notebook に記載しています。（二度手間になるのでnotebook にまとめる方式にしました。）

gist.github.com

のこりのはしばし

定義はサイコロ本とSLPに従っています。

サイコロ本 (統計的自然言語処理の基礎)
- 統計的自然言語処理の基礎 / Christopher D.Manning Hinrich Schutze 著加藤恒昭菊井玄一郎林良彦森辰則訳 | 共立出版
SLP (Speech and Language Processing)
- https://web.stanford.edu/~jurafsky/slp3/3.pdf の 3.6

NLTK実装

NLTKのngram言語モデル実装は (ソースコードのコメントを読む限り) Chen & Goodman 1995. の定義に従っています。

Interpolated KN を含むsmoothing系のモデルは 2.8 (P.18) のAlgorithm Summary にある

$\displaystyle{ p_{smooth} (w_i| w^{i-1}_{i-n+1}) = \alpha (w_i| w^{i-1}_{i-n+1}) + \gamma (w^{i-1}_{i-n+1}) p_{smooth}(w_{i}| w^{i-1}_{i-n+2}) }$

をベースとし、smoothing クラス (estimator) でアルファとガンマを求めることで統一した実装がされています。

interpolated KN 実装

NLTKのInterpolated KN は未知語に対応していないため、 NLTKのモデルは未知語に対してスコアが -inf になるので注意が必要です。

### モデルの用意 ###
import nltk
from nltk.lm.models import KneserNeyInterpolated
from nltk.corpus import brown

from nltk.lm.counter import NgramCounter
from nltk.lm.vocabulary import Vocabulary
from nltk.lm.preprocessing import padded_everygram_pipeline

# nltk.download('brown')
# refer to https://www.nltk.org/howto/lm.html#issue-167
train_data, vocab_data = padded_everygram_pipeline(2,brown.sents(categories="news"))
model = KneserNeyInterpolated(order=2)
model.fit(train_data, vocab_data)
### ここまでモデル用意 ###

# model.estimator が `nltk.lm.smoothing.KneserNey`
print("a", model.estimator.unigram_score("a"))
print("ほげ", model.estimator.unigram_score("ほげ"))
#  a 0.008209501384160118
# ほげ 0.0

#  "ほげ a" のスコア は "a" の unigram_score になる
model.unmasked_score("a", ["ほげ"])

# "a ほげ" のスコア は "ほげ" が未知語なので -inf
model.logscore("ほげ", ["a"])
# -inf

また、 $P _ {continuation}$ を求める時、後ろのcontext ( $w _ {i}$ の前に出現する $w _ {i-1}$ ) をキャッシュしておらず、毎回探索しているため動作が非常に遅いです。

higher_order_ngrams_with_context = (
       counts
       for prefix_ngram, counts in self.counts[len(context) + 2].items()
       if prefix_ngram[1:] == context
)
# https://github.com/nltk/nltk/blob/98a3a123a554ac9765475aee0c6f1e77ca1723da/nltk/lm/smoothing.py#L118-L123

（前回の伏線回収）

KneserNeyInterpolated だけ異常に重かったです。 Kneser-Ney smoothingの仕組みと実装コードをみればそりゃそうだろなって気持ちなのですが、別のブログ記事としてまとめようと思います。
ngram言語モデルについてまとめる (NLTKのngram言語モデル) - エイエイレトリック