で、ベイズ分類器の実装

参考url

ナイーブベイズ分類器とは?

ある文書Dがあるとき、その文書がカテゴリCに属する確率は、ベイズの定理により、次のように表せます。

${ \Large P(C|D) = \frac{ P(D|C) P(C) }{ P(D) } }$

上記において

P(D) は、カテゴリに依存せず、無視しても P(C|D)の大小は判定できる為、無視。

P(C) は、カテゴリが得られる確率で

${ \Large P(C) = \frac{ カテゴリCと判定された文書数 }{ 全文書数 } }$

P(D|C) は、カテゴリが与えられた場合、文書Dが得られる確率で

${ \Large P(D|C) = \frac{ 文書Dの数 }{ カテゴリCに属する文書数 } }$

となるが、実際には以下の近似値を使用する。

${ \Large P(D|C) ≒ P(W_1|C) \cdot P(W_2|C) \cdots P(W_n|C) }$

${ \Large P(W_i|C) = \frac{ 単語 W_i の数}{ カテゴリCにおける語彙数 } }$

例題：ある文書のカテゴリ推定

以下の訓練データがある場合、文書「ボールスポーツ」が属するカテゴリを推定

NO	カテゴリ	出現単語
1	サッカー	ボール、スポーツ、ワールドカップ、ボール
2	野球	ボール、スポーツ、グローブ、バット
3	テニス	ボール、ラケット、コート
4	サッカー	ボール、スポーツ

P(C) (各カテゴリが得られる確率)の計算

-	サッカー	野球	テニス
P(C)	2/4	1/4	1/4

P(D|C) (あるカテゴリにおいて文書Dが得られる確率)の計算

まず、語彙数をカウント (このとき重複もそのままカウント)

-	サッカー	野球	テニス
語彙数	6	4	3

次にカテゴリ内の単語出現率を算出。今回は文書「ボールスポーツ」の判定ですので「ボール」と「スポーツ」を算出。

P(Wi｜C)	サッカー	野球	テニス
ボール	3/6	1/4	1/3
スポーツ	2/6	1/4	0/3

よって、

-	サッカー	野球	テニス
P(D｜C)	6/36	1/8	0/9

P(C|D)≒ P(C) P(D|C) であるから

-	サッカー	野球	テニス
P(C｜D)	(2/4)x(6/36)= 1/12	(1/4)x(1/8)= 1/64	((1/4)x(0/9)=0/36

以上より、文書「ボールスポーツ」は、サッカーカテゴリに属すると考えられる

ナイーブベイズ分類器 by python による画像分類

実践コンピュータビジョンサンプルプログラムいつもの?↑こちらの写経。今回は8章。

尚、訓練データとテストデータは以前のエントリにある oints_normal.pkl と points_normal_test.pkl を使用しています pythonでのk近傍法( k-nearest neighbor algorithm, k-NN ) - end0tknr's kipple - 新web写経開発

#!/usr/local/bin/python
# -*- coding: utf-8 -*-
import matplotlib
matplotlib.use('Agg')
import matplotlib.pylab as plb
import numpy
import pickle


def main():
    # 訓練データのload
    with open('points_normal.pkl', 'r') as f:
#    with open('points_ring.pkl', 'r') as f:
        class_1 = pickle.load(f)
        class_2 = pickle.load(f)
        labels = pickle.load(f)

    # ベイズ分類器による学習
    bc = BayesClassifier()
    bc.train([class_1, class_2], [1, -1])
    
    # テストデータのload
    with open('points_normal_test.pkl', 'r') as f:
#    with open('points_ring_test.pkl', 'r') as f:
        class_1 = pickle.load(f)
        class_2 = pickle.load(f)
        labels = pickle.load(f)
        

    # いくつかの点についてテスト
    print bc.classify(class_1[:10])[0]
    
    # 点と判別境界を表示
    def classify0(x, y, bc=bc):
        points = numpy.vstack((x, y))
        return bc.classify(points.T)[0]
    plot_2D_boundary([-6, 6, -6, 6], [class_1, class_2], classify0, [1,-1])

    plb.savefig( '8_2_bayes_n.png' )


# 独立した平均m分散vの点を、xの行として持つ d次元ガウス分布を評価
def gauss(m, v, x):
    if len(x.shape) == 1:
        n, d = 1, x.shape[0]
    else:
        n, d = x.shape
        
    # 共分散行列を求め、xから平均を引く
    S = numpy.diag(1 / v)  # diag():行列の対角成分
    x = x - m
    # 確率の積,  exp():ネイピア数(e)の?乗, dot():内積
    y = numpy.exp(-0.5 * numpy.diag(numpy.dot(x, numpy.dot(S, x.T))))
    # 正規化して返す, prod():配列の積
    return y * (2 * numpy.pi)**(-d / 2.0) / (numpy.sqrt(numpy.prod(v)) + 1e-6)

class BayesClassifier(object):
    def __init__(self):
        self.labels = []  # クラスのラベル
        self.mean = []    # クラスの平均
        self.var = []     # クラスの分散
        self.n = 0        # クラスの数

        
    # data (n*dimの配列のリスト)で学習。
    # labelsはオプションで、デフォルトは 0...n-1 
    def train(self, data, labels=None):
        if labels == None:
            labels = range(len(data))
        self.labels = labels
        self.n = len(labels)
        for c in data:
            self.mean.append(numpy.mean(c, axis=0))
            self.var.append(numpy.var(c, axis=0))  # var():分散

    # 各クラスの確率を計算し確率の高いラベルを返すことでpointsを分類
    def classify(self, points):
        # 各クラスの確率を計算する
        est_prob = numpy.array(
            [gauss(m, v, points) for m, v in zip(self.mean, self.var)])
        # 最も確率の高いindex noを求め、クラスのラベルを返す
        ndx = est_prob.argmax(axis=0)
        est_labels = numpy.array([self.labels[n] for n in ndx])
        return est_labels, est_prob


# plot_range：(xmin,xmax,ymin,ymax)、points：クラスの点のリスト、
# decisionfcn：評価関数、
# labels：各クラスについてdecisionfcnが返すラベルのリスト、
# values：表示する判別の輪郭のリスト """
def plot_2D_boundary(plot_range, points, decisionfcn, labels, values=[0]):
    clist = ['b', 'r', 'g', 'k', 'm', 'y']  # クラスの描画色
    # グリッドを評価して判別関数の輪郭を描画する
    x = numpy.arange(plot_range[0], plot_range[1], .1)
    y = numpy.arange(plot_range[2], plot_range[3], .1)
    xx, yy = numpy.meshgrid(x, y)
    xxx, yyy = xx.flatten(), yy.flatten()  # グリッドのx,y座標のリスト
    zz = numpy.array(decisionfcn(xxx, yyy))
    zz = zz.reshape(xx.shape)
    # valuesにそって輪郭を描画する
    plb.contour(xx, yy, zz, values)
    # クラスごとに正しい点には*、間違った点には'o'を描画する
    for i in range(len(points)):
        d = decisionfcn(points[i][:, 0], points[i][:, 1])
        correct_ndx = labels[i] == d
        incorrect_ndx = labels[i] != d
        plb.plot(
            points[i][correct_ndx, 0],
            points[i][correct_ndx, 1],
            '*',
            color=clist[i])
        plb.plot(
            points[i][incorrect_ndx, 0],
            points[i][incorrect_ndx, 1],
            'o',
            color=clist[i])
    plb.axis('equal')

if __name__ == '__main__':
    main()