将“搜狗拼音输入法”中的自定义词转换为rseg中的分词库

在昨天的文章里，我描述了如何在Ruby-1.9环境下配置rseg-0.1.7。今天我想到：如果能把自己平时打字时自己造的词作为分词词库的一部分，则在对自己的文章做全文检索时，会有极大帮助。由于我以前用的最多的是搜狗拼音输入法，所以从它开始。

从搜狗输入法的设置里面找到“导出词库到txt文件”的选项，并执行；
得到该txt之后，注意可能需要将其另存为UTF-8字符集。然后运行如下的Ruby程序，将其导出为单纯的词库文件（每行一个词），比如文件名称叫“my_sougou.dict”则运行”ruby sougou.rb > my_sougou.dict”(assume this ruby file is saved as “sougou.rb”)：

# -*- coding: utf-8 -*-
File.open('/home/user/sougou_dict.utf8', 'r') do |file|
    file.each_line do |line|
        word = line.split.last
        puts word if word.length > 1
    end
end

从这里得到rseg的源代码；
Copy “my_sougou.dict” to “$RSEG_HOME/dict” directory;
Modify “$RSEG_HOME/lib/builder/dict.rb” by adding the “my_sougou.dict” file into the files list array, such as:

1	dictionaries = ['cedict.zh_CN.utf8', 'wikipedia.zh.utf8', 'my_sougou.dict']

Run this “dict.rb”, then the “$RSEG_HOME/dict/dict.hash” file should be regenerated. This file is the segment words base in runtime;
Copy and overwrite this “dict.hash” to you local ruby gem “rseg-0.1.7” directory, in my local environment, it’s “~/.rvm/gems/ruby-1.9.2-p180/gems/rseg-0.1.7/dict/dict.hash”;
Then run the following ruby program, it should use the segment words base with words of “my_sougou.dict” in it:

1
2
3

# -*- coding: utf-8 -*-
require 'rseg'
puts Rseg.segment("这里写一些你在搜狐拼音输入法中自己定义的词，应该可以被正确分词。")

以上是导出搜狗拼音输入法自定义词的方法。接下来的目标是研究ubuntu pinyin输入法自定义词的词库导出方法并加入rseg分词库，因为这两年我的日常系统已经切换到了Ubuntu。这里可以作为一个起点。

Gangmax Blog

将“搜狗拼音输入法”中的自定义词转换为rseg中的分词库

Comments