向量空间模型

2015-04-03 | 研究学术 | 机器学习应用自然语言处理 Python

基本概念

向量空间模型（VSM，vector space model）是用向量表示文本的代数模型，它将文本转换为向量，也称为词语向量模型（term vector model）[1]¹。最容易想到的方法就是利用词频（TF，term frequency）。

###词频

词频就是统计词语在文本中出现的次数。

mydoclist = ['Julie loves me more than Linda loves me',
'Jane likes me more than Julie loves me',
'He likes basketball more than baseball']

#mydoclist = ['sun sky bright', 'sun sun bright']

from collections import Counter

for doc in mydoclist:
    tf = Counter()
    for word in doc.split():
        tf[word] +=1
    print tf.items()
    
# Output:
# [('me', 2), ('Julie', 1), ('loves', 2), ('Linda', 1), ('than', 1), ('more', 1)]
# [('me', 2), ('Julie', 1), ('likes', 1), ('loves', 1), ('Jane', 1), ('than', 1), ('more', 1)]
# [('basketball', 1), ('baseball', 1), ('likes', 1), ('He', 1), ('than', 1), ('more', 1)]

以上代码统计了每个文本中词语出现的频率。虽然将文本量化，但是由于生成每个文本的字典不同，文本之间无法比较。因此，需要将每个文本表示成长度相同的向量，这个长度是由全体词语构成的语料库（corpus）决定的。

import string #allows for format()
    
def build_lexicon(corpus):
    lexicon = set()
    for doc in corpus:
        lexicon.update([word for word in doc.split()])
    return lexicon

def tf(term, document):
  return freq(term, document)

def freq(term, document):
  return document.split().count(term)

vocabulary = build_lexicon(mydoclist)

doc_term_matrix = []
print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
for doc in mydoclist:
    print 'The doc is "' + doc + '"'
    tf_vector = [tf(word, doc) for word in vocabulary]
    tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)
    print 'The tf vector for Document %d is [%s]' % ((mydoclist.index(doc)+1), tf_vector_string)
    doc_term_matrix.append(tf_vector)
    
    # here's a test: why did I wrap mydoclist.index(doc)+1 in parens?  it returns an int...
    # try it!  type(mydoclist.index(doc) + 1)

print 'All combined, here is our master document term matrix: '
print doc_term_matrix

# Output:
# Our vocabulary vector is [me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]
# The doc is "Julie loves me more than Linda loves me"
# The tf vector for Document 1 is [2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1]
# The doc is "Jane likes me more than Julie loves me"
# The tf vector for Document 2 is [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1]
# The doc is "He likes basketball more than baseball"
# The tf vector for Document 3 is [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]
# All combined, here is our master document term matrix: 
# [[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1], [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]]

通过这一过程，将文本转换到了向量空间。如果有一篇文本单词频率出现过高，会破坏分析，需将向量规范化，比如$L_2$规范化，使得向量元素的平方和为$1$。

import math
import numpy as np

def l2_normalizer(vec):
    denom = np.sum([el**2 for el in vec])
    return [(el / math.sqrt(denom)) for el in vec]

doc_term_matrix_l2 = []
for vec in doc_term_matrix:
    doc_term_matrix_l2.append(l2_normalizer(vec))

print 'A regular old document term matrix: ' 
print np.matrix(doc_term_matrix)
print '\nA document term matrix with row-wise L2 norms of 1:'
print np.matrix(doc_term_matrix_l2)

# if you want to check this math, perform the following:
# from numpy import linalg as la
# la.norm(doc_term_matrix[0])
# la.norm(doc_term_matrix_l2[0])

# Output:

# A regular old document term matrix: 
# [[2 0 1 0 0 2 0 1 0 1 1]
#  [2 0 1 0 1 1 1 0 0 1 1]
#  [0 1 0 1 1 0 0 0 1 1 1]]

# A document term matrix with row-wise L2 norms of 1:
# [[ 0.57735027  0.          0.28867513  0.          0.          0.57735027
#    0.          0.28867513  0.          0.28867513  0.28867513]
#  [ 0.63245553  0.          0.31622777  0.          0.31622777  0.31622777
#    0.31622777  0.          0.          0.31622777  0.31622777]
#  [ 0.          0.40824829  0.          0.40824829  0.40824829  0.          0.
#    0.          0.40824829  0.40824829  0.40824829]]

通过$L_2$规范化，向量元素的取值范围变为了$[0, 1]$。如果要提升某篇文本和主题的相关性，可以一遍遍重复单词，这种方法可以压低这样的频率提升。

###逆文档频率

词频只考虑了词语在某个特定文档中出现的频率，并未考虑词语在所有文档中的价值。如果某个词比较少见，但是它在这篇文章中多次出现，那么它很可能就反映了这篇文章的特性，正是我们所需要的关键词。例如：一篇文本中“中国”、“蜜蜂”、“养殖”这三个词的出现次数一样多。这是不是意味着，作为关键词，它们的重要性是一样的？显然不是这样。因为“中国”是很常见的词，相对而言，“蜜蜂”和“养殖”不那么常见。如果这三个词在一篇文章的出现次数一样多，有理由认为，“蜜蜂”和“养殖”的重要程度要大于“中国”，也就是说，在关键词排序上面，“蜜蜂”和“养殖”应该排在“中国”的前面。

用统计学语言表达，就是在词频的基础上，要对每个词分配一个重要性权重。最常见的词给予最小的权重，较常见的词给予较小的权重，较少见的词给予较大的权重。这个权重叫做逆文档频率（IDF，inverse document frequency），它的大小与一个词的常见程度成反比²，

\begin{equation} IDF(\mbox{word}) = \log\left(\mbox{num of documents}\over\mbox{num of documents including word}+1\right)。 \end{equation}

def numDocsContaining(word, doclist):
    doccount = 0
    for doc in doclist:
        if freq(word, doc) > 0:
            doccount +=1
    return doccount 

def idf(word, doclist):
    n_samples = len(doclist)
    df = numDocsContaining(word, doclist)
    return np.log(n_samples / (1.+df))

my_idf_vector = [idf(word, mydoclist) for word in vocabulary]

print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
print 'The inverse document frequency vector is [' + ', '.join(format(freq, 'f') for freq in my_idf_vector) + ']'

# Output:

# Our vocabulary vector is [me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]
# The inverse document frequency vector is [0.000000, 0.405465, 0.000000, 0.405465, 0.000000, 0.000000, 0.405465, 0.405465, 0.405465, -0.287682, -0.287682]

如果一个词越常见，那么分母就越大，逆文档频率就越小越接近0。分母之所以要加1，是为了避免分母为0（即所有文档都不包含该词）。

###TF-IDF

\begin{equation} TF-IDF(\mbox{word})=TF(\mbox{word})\times IDF(\mbox{word})。 \end{equation}

doc_term_matrix_tfidf = []

#performing tf-idf matrix multiplication
for tf_vector in doc_term_matrix:
    doc_term_matrix_tfidf.append(np.multiply(tf_vector, my_idf_vector))

#normalizing
doc_term_matrix_tfidf_l2 = []
for tf_vector in doc_term_matrix_tfidf:
    doc_term_matrix_tfidf_l2.append(l2_normalizer(tf_vector))
                                    
print vocabulary
print np.matrix(doc_term_matrix_tfidf_l2) # np.matrix() just to make it easier to look at

# Output:
# 
# set(['me', 'basketball', 'Julie', 'baseball', 'likes', 'loves', 'Jane', 'Linda', 'He', 'than', 'more'])
# [[ 0.          0.          0.          0.          0.          0.          0.
#    0.70590555  0.         -0.50084796 -0.50084796]
#  [ 0.          0.          0.          0.          0.          0.
#    0.70590555  0.          0.         -0.50084796 -0.50084796]
#  [ 0.          0.49957476  0.          0.49957476  0.          0.          0.
#    0.          0.49957476 -0.35445393 -0.35445393]]

事实上，scikits-learn包含了计算TF-IDF的算法，但由于处理除以0的问题，TfidfVectorizer/TfidfTransformer得到的结果与以上计算过程不同。

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(min_df=1)
term_freq_matrix = count_vectorizer.fit_transform(mydoclist)
print "Vocabulary:", count_vectorizer.vocabulary_

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(norm="l2")
tfidf.fit(term_freq_matrix)

tf_idf_matrix = tfidf.transform(term_freq_matrix)
print tf_idf_matrix.todense()

# Output:
# Vocabulary: {u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}
# [[ 0.          0.          0.          0.          0.28945906  0.
#    0.38060387  0.57891811  0.57891811  0.22479078  0.22479078]
#  [ 0.          0.          0.          0.41715759  0.3172591   0.3172591
#    0.          0.3172591   0.6345182   0.24637999  0.24637999]
#  [ 0.48359121  0.48359121  0.48359121  0.          0.          0.36778358
#    0.          0.          0.          0.28561676  0.28561676]]

应用示例

向量空间模型的TF-IDF可用于自动提取关键词、找出相似文章和自动摘要等。

参考资料

[1]G. Salton, A. Wong, and C.-S. Yang, “A Vector Space Model for Automatic Indexing,” Communications of the ACM, vol. 18, no. 11, pp. 613–620, Nov. 1975. [Online]

脚注

代码主要参考了“The Vector Space Model of text”，但文中计算有错误。 ↩
$\log$的底是多少？ ↩

打赏作者

2016-10-24 » NNML（03）：BP 学习
2016-10-16 » NNML（02）：感知器学习
2016-10-09 » 鲁棒及自适应控制（2）：模型
2016-10-09 » NNML（01）：引言
2016-09-24 » 家用监控设备用于电网的可行性分析
2016-09-19 » 无人机电缆隧道巡检可行性调研报告
2016-09-18 » 鲁棒及自适应控制（1）：概论
2016-09-13 » Gradient-Based Learning Applied to Document Recognition

上一篇：工业机器人（2）：技术发展综述下一篇：Python Essential