以前、sudachiのユーザ辞書を作成しましたが、 今回は、同じユーザ辞書を用い、ブラウザ環境で形態素解析を行います。
参考url
- sudachipy for python (miniconda for win)による sudachiユーザ辞書作成 - end0tknr's kipple - web写経開発
- sudachipy for python による sudachiユーザ辞書の利用 (形態素解析) - end0tknr's kipple - web写経開発
- kuromoji.js + SudachiDict で形態素解析(辞書のビルド、IPADic・UniDic との比較) #JavaScript - Qiita
事前準備 1 - node.js v10.16.3 のインストール
https://github.com/coreybutler/nvm-windows/releases/download/1.1.12/nvm-setup.exe
nvm for winのインストーラが上記urlにありますので、 インストール後、以下のコマンドを実行
DOS> nvm --version 1.1.12 DOS> nvm install v10.16.3 DOS> nvm use 10.16.3 Now using node v10.16.3 (64-bit) DOS> node --version v10.16.3
事前準備 2 - kuromoji.js のインストール
DOS> npm install kuromoji <略> + kuromoji@0.1.2 DOS> cd node_modules/kuromoji DOS> npm install
上記のようにインストールすると、 以下のように辞書ファイルを作成できます。
DOS> npm run build-dict DOS> dir dict ドライブ C のボリューム ラベルがありません。 ボリューム シリアル番号は 00DD-2D3B です 2024/09/18 09:09 <DIR> . 2024/09/18 09:05 <DIR> .. 2024/09/18 09:08 55,873,342 base.dat.gz 2024/09/18 09:08 25,614,933 cc.dat.gz 2024/09/18 09:08 48,926,448 check.dat.gz 2024/09/18 09:08 12,272,244 tid.dat.gz 2024/09/18 09:09 10,529,545 tid_map.dat.gz 2024/09/18 09:09 36,804,066 tid_pos.dat.gz 2024/09/18 09:09 10,491 unk.dat.gz 2024/09/18 09:09 320 unk_char.dat.gz 2024/09/18 09:09 338 unk_compat.dat.gz 2024/09/18 09:09 1,141 unk_invoke.dat.gz 2024/09/18 09:09 1,177 unk_map.dat.gz 2024/09/18 09:09 10,524 unk_pos.dat.gz
kuromoji.js + SudachiDict 環境での ユーザ辞書作成
作成の手順は、参考urlの通りですが、 手間なので、以下のようにpython scriptにしてみました
#!python # -*- coding: utf-8 -*- """ https://qiita.com/piijey/items/2517af039bbedddec7b8 """ from datetime import datetime from pathlib import Path import csv import glob import io import logging.config import os import requests import shutil import subprocess import sys import urllib.request import zipfile import time CONF = { "sudachi":{ "dic_src_base_url" : "http://sudachi.s3-website-ap-northeast-1.amazonaws.com/sudachidict-raw", # refer to https://github.com/WorksApplications/SudachiDict/blob/develop/build.gradle "dic_src_paths" : ["matrix.def.zip", "20240409/small_lex.zip", "20240409/core_lex.zip", #"20240409/notcore_lex.zip" ], "dic_def_base_url" : "https://github.com/WorksApplications/Sudachi/raw/develop/src/main/resources", "dic_def_paths" : ["char.def", "unk.def" ], "usrdic_dir":"c:/Users/xcendou/local/FIND_ZUMEN2/sudachi" }, "kuromoji":{ "base_dir":"C:/Users/xcendou/local/FIND_ZUMEN2/kuromoji/node_modules/kuromoji", "build_cmds":["set NODE_OPTIONS=--max-old-space-size=4096", "npm run build-dict"], "ipadic_src_dir": "c:/Users/xcendou/local/FIND_ZUMEN2/kuromoji/"+ "node_modules/kuromoji/node_modules/mecab-ipadic-seed/lib/dict", "backup_dir": "c:/Users/xcendou/local/FIND_ZUMEN2/kuromoji/"+ "node_modules/kuromoji/node_modules/mecab-ipadic-seed/lib/dict_bak", }, "log":{ 'version': 1, 'loggers': {"mainLogger": {'level':"INFO",'handlers':["mainHandler"]}, }, 'handlers': { "mainHandler": { 'formatter': "mainFormatter", 'class' : 'logging.handlers.RotatingFileHandler', 'filename' : os.path.splitext(os.path.basename(__file__))[0] + \ "_"+datetime.now().strftime("%m%d")+ ".log", 'encoding' : 'utf-8', 'maxBytes' : 1024*1024*10, # MB 'backupCount': 30 # rotation }}, 'formatters': { "mainFormatter":{ "format": "%(asctime)s\t%(levelname)s\t%(filename)s"+ "\tL%(lineno)d\t%(funcName)s\t%(message)s", "datefmt": '%Y/%m/%d %H:%M:%S' }}, } } logging.config.dictConfig(CONF["log"]) logger = logging.getLogger('mainLogger') def main(): logger.info("START") init_dic_src_dir() download_sudachi_dic_src() conv_lex_csv_for_kuromoji() conv_sudashi_usrdic_for_kuromoji() build_kuromoji_dic() def build_kuromoji_dic(): org_dir = os.getcwd() os.chdir( CONF["kuromoji"]["base_dir"] ) for cmd_str in CONF["kuromoji"]["build_cmds"]: exec_subprocess( cmd_str ) os.chdir( org_dir ) def conv_sudashi_usrdic_for_kuromoji(): for org_path in glob.glob( CONF["sudachi"]["usrdic_dir"] + "/*.dic.csv"): org_rows = [] with open(org_path, encoding='utf-8') as f: csvreader = csv.reader(f) org_rows = [row for row in csvreader] new_path = CONF["kuromoji"]["ipadic_src_dir"] + "/"+ os.path.basename(org_path) print( new_path ) with open(new_path, "w", encoding='utf-8') as f: writer = csv.writer(f, lineterminator='\n') for org_row in org_rows: new_row = [org_row[0], 1285, 1285, 5402, "名詞","普通名詞","*","*","*","*", org_row[12], "*","*"] writer.writerow( new_row ) def conv_lex_csv_for_kuromoji(): for lex_path in glob.glob( CONF["kuromoji"]["ipadic_src_dir"] + "/*_lex.csv"): org_rows = [] with open(lex_path, encoding='utf-8') as f: csvreader = csv.reader(f) org_rows = [row for row in csvreader] with open(lex_path, "w", encoding='utf-8') as f: writer = csv.writer(f, lineterminator='\n') for org_row in org_rows: new_row = [ org_row[0], org_row[1], org_row[2], org_row[3], org_row[5], org_row[6], org_row[7], org_row[9], org_row[10],"*", org_row[12],org_row[11],"*"] writer.writerow( new_row ) def download_sudachi_dic_src(): for path in CONF["sudachi"]["dic_src_paths"]: req_url = CONF["sudachi"]["dic_src_base_url"] +"/"+ path src_content = get_http_requests(req_url) zip = zipfile.ZipFile( io.BytesIO(src_content) ) zip.extractall( CONF["kuromoji"]["ipadic_src_dir"] ) for path in CONF["sudachi"]["dic_def_paths"]: req_url = CONF["sudachi"]["dic_def_base_url"] +"/"+ path src_content = get_http_requests(req_url) save_path = CONF["kuromoji"]["ipadic_src_dir"] + "/"+ path with open(save_path, 'w',encoding='utf-8') as f: f.write(src_content.decode("utf-8")) def init_dic_src_dir(): backup_dir = CONF["kuromoji"]["backup_dir"] if not os.path.isdir( backup_dir ): Path( backup_dir ).mkdir() local_src_dir = CONF["kuromoji"]["ipadic_src_dir"] if os.path.isdir( local_src_dir ): # 旧辞書srcがあればbackup if len( glob.glob(local_src_dir+'/**', recursive=True) ) > 0: bak_filename = ".".join([ os.path.split(local_src_dir)[-1], datetime.now().strftime("%m%d") ]) bak_path = backup_dir +"/"+bak_filename shutil.make_archive(bak_path, format='zip', root_dir=local_src_dir) shutil.rmtree(local_src_dir) Path( local_src_dir ).mkdir() return local_src_dir def get_http_requests(req_url): logger.info("START %s",req_url) i = 0 while i < 3: # 最大3回 retry i += 1 try: # 先方サーバによっては http response codeを返さない為、try-except res = requests.get(req_url, timeout=(5,60), stream=True,verify=False) except Exception as e: logger.warning(e) logger.warning("retry {} {}".format(i,req_url)) time.sleep(10) if res.status_code == 404: logger.error( "404 error {}".format(req_url) ) return try: res.raise_for_status() except Exception as e: logger.warning(e) logger.warning("retry {} {}".format(i,req_url)) time.sleep(10) # 大容量の為か urllib.request.urlopen()では # response contentを取得できなかった為、stream=True で chunk化 chunks = [] for chunk in res.iter_content(chunk_size=1024*1024): chunks.append(chunk) content = b"".join(chunks) return content def exec_subprocess(cmd:str, raise_error=True): child = subprocess.Popen( cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE ) stdout, stderr = child.communicate() rt = child.returncode if rt != 0 and raise_error: print("ERROR",stderr,file=sys.stderr) return (None,None,None) return stdout, stderr, rt if __name__ == "__main__": main()