将“Ubuntu ibus Pinyin 1.3.0”中的自定义词转换为rseg中的分词库

我今天花了一点时间搞定了这个问题，以下是具体的步骤：

首先找到“Ubuntu ibus Pinyin 1.3.0”的自定义词库在哪里。通过google搜索到该词库位置在”~/.cache/ibus/pinyin/user-1.3.db”.其表结构很简单：两个字的词保存在”py_phrase_1”表中，三个字的词保存在”py_phrase_2”表中，等等。我的自定义词库最多的是七个字，也就是从两个字的”py_phrase_1”到七个字的”py_phrase_6”。通过”Sqliteman”将以上六个表导出为csv文件，名称分别为2.csv, 3.csv, …, 6.csv.
编写程序，从这些文件中提取单纯的词。程序如下：

# -*- coding: utf-8 -*-
#--
# Parse the line like
# "14", "网络", "8964", "19", "27", "10", "55"
#++

ARGV.each do |arg|
  File.open(File.join(File.dirname(__FILE__), arg), 'r') do |file|
    file.each_line do |line|
      if(line!=nil)
        word = line.split[1]
        if(word!=nil)
          start_index = word.index('"')
          if(start_index >= 0)
            end_index = word.index('"', start_index + 1)
          end
          if(start_index >= 0 && end_index > start_index)
            word = word[start_index + 1, end_index - 1]
            # exclude the first line: "user_freq", "phrase", "freq", "s0", "y0", "s1", "y1"
            puts word if word != "phrase" 
          end
        end
      end
    end
  end
end

假设该文件保存为export.rb，则运行：

1	ruby export.rb 2.csv 3.csv 4.csv 5.csv 6.csv 7.csv > ubuntu_words.txt

从“这里”得到rseg的代码。将上一步得到的“ubuntu_words.txt”复制到在本地的workcopy的目录“dict”中;
修改”lib/builder/dict.rb”，加入词库文件：

1	dictionaries = ['cedict.zh_CN.utf8', 'wikipedia.zh.utf8', 'ubuntu_words.txt']

运行”lib/builder/dict.rb”

1 2	cd rseg/lib/builder #进入step3的本地代码路径 ruby ./dict.rb

则“dict/dict.hash”文件会被按照新的词库文件重新生成；

将新的“dict.hash“文件复制到本地gem库中的rseg gem的“dict”路径下，覆盖同名文件；
此时运行rseg，则使用的分词词库就包含了ibus Pinyin输入法中的自定义词。

Gangmax Blog

将“Ubuntu ibus Pinyin 1.3.0”中的自定义词转换为rseg中的分词库

Comments