Gensim中文词向量建模
Gensim中文词向量建模
AlexGensim中文词向量建模
- 自然语言处理
安装Gensim
1 | pip install gemsim |
- 分词安装
1
pip install jieba
- 建模word2vec
维基百科词库中文建模练习
wiki字库下载
xml 文本抽取
- 使用Wikipedia Extractor抽取正文
1
2
3
4
5
6git clone https://github.com/attardi/wikiextractor.git wikiextractor
cd wikiextractor
python setup.py install
./WikiExtractor.py -b 500M -o extracted zhwiki-latest-pages-articles.xml.bz2
### -o extracted -o 制定输出目录 运行完查看目录抽取文件
### -b 文件大小 默认是1M
- 使用Wikipedia Extractor抽取正文
繁简转换
1
2
3
4brew install opencc
opencc -i wiki_00 -o zh_wiki_00 -c zht2zhs.ini
### zht2zhs.ini 报错使用下面配置
opencc -i wiki_00 -o wiki_00_zh.txt -c t2s.json分词预处理
该步骤包含分词,剔除标点符号和去文章结构标识。(建议word2vec训练数据不要去除标点符号,比如在情感分析应用中标点符号很有用)最终将得到分词好的纯文本文件,每行对应一篇文章,词语间以空格作为分隔符。script_seg.py如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys, codecs
import jieba.posseg as pseg
reload(sys)
sys.setdefaultencoding('utf-8')
if __name__ == '__main__':
if len(sys.argv) < 3:
print "Usage: python script.py infile outfile"
sys.exit()
i = 0
infile, outfile = sys.argv[1:3]
output = codecs.open(outfile, 'w', 'utf-8')
with codecs.open(infile, 'r', 'utf-8') as myfile:
for line in myfile:
line = line.strip()
if len(line) < 1:
continue
if line.startswith('<doc'):
i = i + 1
if(i % 1000 == 0):
print('Finished ' + str(i) + ' articles')
continue
if line.startswith('</doc'):
output.write('\n')
continue
words = pseg.cut(line)
for word, flag in words:
if flag.startswith('x'):
continue
output.write(word + ' ')
output.close()
print('Finished ' + str(i) + ' articles')执行命令
1
time python script_seg.py std_wiki_00_zh.txt seg_std_wiki_00_zh.txt
去除空白行
1
sed '/^$/d' seg_std_wiki_00_zh.txt > trim_seg_std_wiki_00_zh.txt
训练word2vec模型
- script_train.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys, codecs
import gensim, logging, multiprocessing
reload(sys)
sys.setdefaultencoding('utf-8')
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
if __name__ == '__main__':
if len(sys.argv) < 3:
print "Usage: python script.py infile outfile"
sys.exit()
infile, outfile = sys.argv[1:3]
model = gensim.models.Word2Vec(gensim.models.word2vec.LineSentence(infile), size=400, window=5, min_count=5, sg=0, workers=multiprocessing.cpu_count())
model.save(outfile)
model.save_word2vec_format(outfile + '.vector', binary=False) - 运行
1
time python script_train.py trim_seg_std_wiki_00_zh.txt trim_seg_std_wiki_00_zh.model