基本概念
向量空间模型(VSM,vector space model)是用向量表示文本的代数模型,它将文本转换为向量,也称为词语向量模型(term vector model)[1]1。最容易想到的方法就是利用词频(TF,term frequency)。
###词频
词频就是统计词语在文本中出现的次数。
mydoclist = ['Julie loves me more than Linda loves me',
'Jane likes me more than Julie loves me',
'He likes basketball more than baseball']
#mydoclist = ['sun sky bright', 'sun sun bright']
from collections import Counter
for doc in mydoclist:
tf = Counter()
for word in doc.split():
tf[word] +=1
print tf.items()
# Output:
# [('me', 2), ('Julie', 1), ('loves', 2), ('Linda', 1), ('than', 1), ('more', 1)]
# [('me', 2), ('Julie', 1), ('likes', 1), ('loves', 1), ('Jane', 1), ('than', 1), ('more', 1)]
# [('basketball', 1), ('baseball', 1), ('likes', 1), ('He', 1), ('than', 1), ('more', 1)]
以上代码统计了每个文本中词语出现的频率。虽然将文本量化,但是由于生成每个文本的字典不同,文本之间无法比较。因此,需要将每个文本表示成长度相同的向量,这个长度是由全体词语构成的语料库(corpus)决定的。
import string #allows for format()
def build_lexicon(corpus):
lexicon = set()
for doc in corpus:
lexicon.update([word for word in doc.split()])
return lexicon
def tf(term, document):
return freq(term, document)
def freq(term, document):
return document.split().count(term)
vocabulary = build_lexicon(mydoclist)
doc_term_matrix = []
print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
for doc in mydoclist:
print 'The doc is "' + doc + '"'
tf_vector = [tf(word, doc) for word in vocabulary]
tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)
print 'The tf vector for Document %d is [%s]' % ((mydoclist.index(doc)+1), tf_vector_string)
doc_term_matrix.append(tf_vector)
# here's a test: why did I wrap mydoclist.index(doc)+1 in parens? it returns an int...
# try it! type(mydoclist.index(doc) + 1)
print 'All combined, here is our master document term matrix: '
print doc_term_matrix
# Output:
# Our vocabulary vector is [me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]
# The doc is "Julie loves me more than Linda loves me"
# The tf vector for Document 1 is [2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1]
# The doc is "Jane likes me more than Julie loves me"
# The tf vector for Document 2 is [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1]
# The doc is "He likes basketball more than baseball"
# The tf vector for Document 3 is [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]
# All combined, here is our master document term matrix:
# [[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1], [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]]
通过这一过程,将文本转换到了向量空间。如果有一篇文本单词频率出现过高,会破坏分析,需将向量规范化,比如$L_2$规范化,使得向量元素的平方和为$1$。
import math
import numpy as np
def l2_normalizer(vec):
denom = np.sum([el**2 for el in vec])
return [(el / math.sqrt(denom)) for el in vec]
doc_term_matrix_l2 = []
for vec in doc_term_matrix:
doc_term_matrix_l2.append(l2_normalizer(vec))
print 'A regular old document term matrix: '
print np.matrix(doc_term_matrix)
print '\nA document term matrix with row-wise L2 norms of 1:'
print np.matrix(doc_term_matrix_l2)
# if you want to check this math, perform the following:
# from numpy import linalg as la
# la.norm(doc_term_matrix[0])
# la.norm(doc_term_matrix_l2[0])
# Output:
# A regular old document term matrix:
# [[2 0 1 0 0 2 0 1 0 1 1]
# [2 0 1 0 1 1 1 0 0 1 1]
# [0 1 0 1 1 0 0 0 1 1 1]]
# A document term matrix with row-wise L2 norms of 1:
# [[ 0.57735027 0. 0.28867513 0. 0. 0.57735027
# 0. 0.28867513 0. 0.28867513 0.28867513]
# [ 0.63245553 0. 0.31622777 0. 0.31622777 0.31622777
# 0.31622777 0. 0. 0.31622777 0.31622777]
# [ 0. 0.40824829 0. 0.40824829 0.40824829 0. 0.
# 0. 0.40824829 0.40824829 0.40824829]]
通过$L_2$规范化,向量元素的取值范围变为了$[0, 1]$。如果要提升某篇文本和主题的相关性,可以一遍遍重复单词,这种方法可以压低这样的频率提升。
###逆文档频率
词频只考虑了词语在某个特定文档中出现的频率,并未考虑词语在所有文档中的价值。如果某个词比较少见,但是它在这篇文章中多次出现,那么它很可能就反映了这篇文章的特性,正是我们所需要的关键词。例如:一篇文本中“中国”、“蜜蜂”、“养殖”这三个词的出现次数一样多。这是不是意味着,作为关键词,它们的重要性是一样的?显然不是这样。因为“中国”是很常见的词,相对而言,“蜜蜂”和“养殖”不那么常见。如果这三个词在一篇文章的出现次数一样多,有理由认为,“蜜蜂”和“养殖”的重要程度要大于“中国”,也就是说,在关键词排序上面,“蜜蜂”和“养殖”应该排在“中国”的前面。
用统计学语言表达,就是在词频的基础上,要对每个词分配一个重要性权重。最常见的词给予最小的权重,较常见的词给予较小的权重,较少见的词给予较大的权重。这个权重叫做逆文档频率(IDF,inverse document frequency),它的大小与一个词的常见程度成反比2,
\begin{equation} IDF(\mbox{word}) = \log\left(\mbox{num of documents}\over\mbox{num of documents including word}+1\right)。 \end{equation}
def numDocsContaining(word, doclist):
doccount = 0
for doc in doclist:
if freq(word, doc) > 0:
doccount +=1
return doccount
def idf(word, doclist):
n_samples = len(doclist)
df = numDocsContaining(word, doclist)
return np.log(n_samples / (1.+df))
my_idf_vector = [idf(word, mydoclist) for word in vocabulary]
print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
print 'The inverse document frequency vector is [' + ', '.join(format(freq, 'f') for freq in my_idf_vector) + ']'
# Output:
# Our vocabulary vector is [me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]
# The inverse document frequency vector is [0.000000, 0.405465, 0.000000, 0.405465, 0.000000, 0.000000, 0.405465, 0.405465, 0.405465, -0.287682, -0.287682]
如果一个词越常见,那么分母就越大,逆文档频率就越小越接近0。分母之所以要加1,是为了避免分母为0(即所有文档都不包含该词)。
###TF-IDF
\begin{equation} TF-IDF(\mbox{word})=TF(\mbox{word})\times IDF(\mbox{word})。 \end{equation}
doc_term_matrix_tfidf = []
#performing tf-idf matrix multiplication
for tf_vector in doc_term_matrix:
doc_term_matrix_tfidf.append(np.multiply(tf_vector, my_idf_vector))
#normalizing
doc_term_matrix_tfidf_l2 = []
for tf_vector in doc_term_matrix_tfidf:
doc_term_matrix_tfidf_l2.append(l2_normalizer(tf_vector))
print vocabulary
print np.matrix(doc_term_matrix_tfidf_l2) # np.matrix() just to make it easier to look at
# Output:
#
# set(['me', 'basketball', 'Julie', 'baseball', 'likes', 'loves', 'Jane', 'Linda', 'He', 'than', 'more'])
# [[ 0. 0. 0. 0. 0. 0. 0.
# 0.70590555 0. -0.50084796 -0.50084796]
# [ 0. 0. 0. 0. 0. 0.
# 0.70590555 0. 0. -0.50084796 -0.50084796]
# [ 0. 0.49957476 0. 0.49957476 0. 0. 0.
# 0. 0.49957476 -0.35445393 -0.35445393]]
事实上,scikits-learn包含了计算TF-IDF的算法,但由于处理除以0的问题,TfidfVectorizer/TfidfTransformer
得到的结果与以上计算过程不同。
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(min_df=1)
term_freq_matrix = count_vectorizer.fit_transform(mydoclist)
print "Vocabulary:", count_vectorizer.vocabulary_
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(term_freq_matrix)
tf_idf_matrix = tfidf.transform(term_freq_matrix)
print tf_idf_matrix.todense()
# Output:
# Vocabulary: {u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}
# [[ 0. 0. 0. 0. 0.28945906 0.
# 0.38060387 0.57891811 0.57891811 0.22479078 0.22479078]
# [ 0. 0. 0. 0.41715759 0.3172591 0.3172591
# 0. 0.3172591 0.6345182 0.24637999 0.24637999]
# [ 0.48359121 0.48359121 0.48359121 0. 0. 0.36778358
# 0. 0. 0. 0.28561676 0.28561676]]
应用示例
向量空间模型的TF-IDF可用于自动提取关键词、找出相似文章和自动摘要等。
参考资料
- [1]G. Salton, A. Wong, and C.-S. Yang, “A Vector Space Model for Automatic Indexing,” Communications of the ACM, vol. 18, no. 11, pp. 613–620, Nov. 1975. [Online]
脚注
-
代码主要参考了“The Vector Space Model of text”,但文中计算有错误。 ↩
-
$\log$的底是多少? ↩