得到该txt之后,注意可能需要将其另存为UTF-8字符集。然后运行如下的Ruby程序,将其导出为单纯的词库文件(每行一个词),比如文件名称叫“my_sougou.dict”则运行”ruby sougou.rb > my_sougou.dict”(assume this ruby file is saved as “sougou.rb”):
1 | # -*- coding: utf-8 -*- |
Copy “my_sougou.dict” to “$RSEG_HOME/dict” directory;
Modify “$RSEG_HOME/lib/builder/dict.rb” by adding the “my_sougou.dict” file into the files list array, such as:
1 | dictionaries = ['cedict.zh_CN.utf8', 'wikipedia.zh.utf8', 'my_sougou.dict'] |
Run this “dict.rb”, then the “$RSEG_HOME/dict/dict.hash” file should be regenerated. This file is the segment words base in runtime;
Copy and overwrite this “dict.hash” to you local ruby gem “rseg-0.1.7” directory, in my local environment, it’s “~/.rvm/gems/ruby-1.9.2-p180/gems/rseg-0.1.7/dict/dict.hash”;
Then run the following ruby program, it should use the segment words base with words of “my_sougou.dict” in it:
1 | # -*- coding: utf-8 -*- |
以上是导出搜狗拼音输入法自定义词的方法。接下来的目标是研究ubuntu pinyin输入法自定义词的词库导出方法并加入rseg分词库,因为这两年我的日常系统已经切换到了Ubuntu。这里可以作为一个起点。