用 Gensim 实现 LDA,相比 JGibbLDA 的使用 Gensim 略为麻烦,然而感觉更清晰易懂,也就更灵活。
LDA 介绍
LDA 是一种典型的词袋模型,即一篇文档是由一组词构成,词与词之间没有顺序以及先后的关系。一篇文档可以包含多个主题,文档中每一个词都由其中的一个主题生成。
需要理解的概念有:
- 一个函数:gamma 函数
- 两个分布:beta分布、Dirichlet分布
- 一个模型:LDA(文档-主题,主题-词语)
- 一个采样:Gibbs采样
核心公式:1p(w|d) = p(w|t)*p(t|d)
文档的生成过程
- 从狄利克雷分布中取样生成文档 i 的主题分布 $\theta_i$
- 从主题的多项式分布中取样生成文档i第 j 个词的主题 $z_{i,j}$
- 从狄利克雷分布中取样生成主题对应的词语分布 $\varnothing_{z_{i,j}}$
- 从词语的多项式分布 $\varnothing_{z_{i,j}}$ 中采样最终生成词语 $w_{i,j}$
怎么选择 topic 个数
- 最小化 topic 的相似度
- perplexity
Python gensim 实现
# install the related python packages
>>> pip install numpy
>>> pip install scipy
>>> pip install gensim
>>> pip install jieba
from gensim import corpora, models, similarities
import logging
import jieba
# configuration
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# load data from file
f = open('newfile.txt', 'r')
documents = f.readlines()
# tokenize
texts = [[word for word in jieba.cut(document, cut_all = False)] for document in documents]
# load id->word mapping (the dictionary)
dictionary = corpora.Dictionary(texts)
# word must appear >10 times, and no more than 40% documents
dictionary.filter_extremes(no_below=40, no_above=0.1)
# save dictionary
dictionary.save('dict_v1.dict')
# load corpus
corpus = [dictionary.doc2bow(text) for text in texts]
# initialize a model
tfidf = models.TfidfModel(corpus)
# use the model to transform vectors, apply a transformation to a whole corpus
corpus_tfidf = tfidf[corpus]
# extract 100 LDA topics, using 1 pass and updating once every 1 chunk (10,000 documents), using 500 iterations
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=100, iterations=500)
# save model to files
lda.save('mylda_v1.pkl')
# print topics composition, and their scores, for the first document. You will see that only few topics are represented; the others have a nil score.
for index, score in sorted(lda[corpus_tfidf[0]], key=lambda tup: -1*tup[1]):
print "Score: {}\t Topic: {}".format(score, lda.print_topic(index, 10))
# print the most contributing words for 100 randomly selected topics
lda.print_topics(100)
# load model and dictionary
model = models.LdaModel.load('mylda_v1.pkl')
dictionary = corpora.Dictionary.load('dict_v1.dict')
# predict unseen data
query = "未收到奖励"
query_bow = dictionary.doc2bow(jieba.cut(query, cut_all = False))
for index, score in sorted(model[query_bow], key=lambda tup: -1*tup[1]):
print "Score: {}\t Topic: {}".format(score, model.print_topic(index, 20))
# if you want to predict many lines of data in a file, do the followings
f = open('newfile.txt', 'r')
documents = f.readlines()
texts = [[word for word in jieba.cut(document, cut_all = False)] for document in documents]
corpus = [dictionary.doc2bow(text) for text in texts]
# only print the topic with the highest score
for c in corpus:
flag = True
for index, score in sorted(model[c], key=lambda tup: -1*tup[1]):
if flag:
print "Score: {}\t Topic: {}".format(score, model.print_topic(index, 20))
Tips
If you occur encoding problems, you can try the following code
add it at the beginning of your python file
# -*- coding: utf-8 -*-
# also, do the followings
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# the following code may lead to encoding problem when there're Chinese characters
model.show_topics(-1, 5)
# use this instead
model.print_topics(-1, 5)
You can see step-by-step output by the following references.
References:
https://radimrehurek.com/gensim/tut2.html official guide (en)
http://blog.csdn.net/questionfish/article/details/46725475 official guide (ch)
https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation