
| 研究学术  | 机器学习应用  自然语言处理  Python 


向量空间模型(VSM,vector space model)是用向量表示文本的代数模型,它将文本转换为向量,也称为词语向量模型(term vector model)[1]1。最容易想到的方法就是利用词频(TF,term frequency)。



mydoclist = ['Julie loves me more than Linda loves me',
'Jane likes me more than Julie loves me',
'He likes basketball more than baseball']

#mydoclist = ['sun sky bright', 'sun sun bright']

from collections import Counter

for doc in mydoclist:
    tf = Counter()
    for word in doc.split():
        tf[word] +=1
    print tf.items()
# Output:
# [('me', 2), ('Julie', 1), ('loves', 2), ('Linda', 1), ('than', 1), ('more', 1)]
# [('me', 2), ('Julie', 1), ('likes', 1), ('loves', 1), ('Jane', 1), ('than', 1), ('more', 1)]
# [('basketball', 1), ('baseball', 1), ('likes', 1), ('He', 1), ('than', 1), ('more', 1)]


import string #allows for format()
def build_lexicon(corpus):
    lexicon = set()
    for doc in corpus:
        lexicon.update([word for word in doc.split()])
    return lexicon

def tf(term, document):
  return freq(term, document)

def freq(term, document):
  return document.split().count(term)

vocabulary = build_lexicon(mydoclist)

doc_term_matrix = []
print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
for doc in mydoclist:
    print 'The doc is "' + doc + '"'
    tf_vector = [tf(word, doc) for word in vocabulary]
    tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)
    print 'The tf vector for Document %d is [%s]' % ((mydoclist.index(doc)+1), tf_vector_string)
    # here's a test: why did I wrap mydoclist.index(doc)+1 in parens?  it returns an int...
    # try it!  type(mydoclist.index(doc) + 1)

print 'All combined, here is our master document term matrix: '
print doc_term_matrix

# Output:
# Our vocabulary vector is [me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]
# The doc is "Julie loves me more than Linda loves me"
# The tf vector for Document 1 is [2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1]
# The doc is "Jane likes me more than Julie loves me"
# The tf vector for Document 2 is [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1]
# The doc is "He likes basketball more than baseball"
# The tf vector for Document 3 is [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]
# All combined, here is our master document term matrix: 
# [[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1], [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]]


import math
import numpy as np

def l2_normalizer(vec):
    denom = np.sum([el**2 for el in vec])
    return [(el / math.sqrt(denom)) for el in vec]

doc_term_matrix_l2 = []
for vec in doc_term_matrix:

print 'A regular old document term matrix: ' 
print np.matrix(doc_term_matrix)
print '\nA document term matrix with row-wise L2 norms of 1:'
print np.matrix(doc_term_matrix_l2)

# if you want to check this math, perform the following:
# from numpy import linalg as la
# la.norm(doc_term_matrix[0])
# la.norm(doc_term_matrix_l2[0])

# Output:

# A regular old document term matrix: 
# [[2 0 1 0 0 2 0 1 0 1 1]
#  [2 0 1 0 1 1 1 0 0 1 1]
#  [0 1 0 1 1 0 0 0 1 1 1]]

# A document term matrix with row-wise L2 norms of 1:
# [[ 0.57735027  0.          0.28867513  0.          0.          0.57735027
#    0.          0.28867513  0.          0.28867513  0.28867513]
#  [ 0.63245553  0.          0.31622777  0.          0.31622777  0.31622777
#    0.31622777  0.          0.          0.31622777  0.31622777]
#  [ 0.          0.40824829  0.          0.40824829  0.40824829  0.          0.
#    0.          0.40824829  0.40824829  0.40824829]]

通过$L_2$规范化,向量元素的取值范围变为了$[0, 1]$。如果要提升某篇文本和主题的相关性,可以一遍遍重复单词,这种方法可以压低这样的频率提升。



用统计学语言表达,就是在词频的基础上,要对每个词分配一个重要性权重。最常见的词给予最小的权重,较常见的词给予较小的权重,较少见的词给予较大的权重。这个权重叫做逆文档频率(IDF,inverse document frequency),它的大小与一个词的常见程度成反比2

\begin{equation} IDF(\mbox{word}) = \log\left(\mbox{num of documents}\over\mbox{num of documents including word}+1\right)。 \end{equation}

def numDocsContaining(word, doclist):
    doccount = 0
    for doc in doclist:
        if freq(word, doc) > 0:
            doccount +=1
    return doccount 

def idf(word, doclist):
    n_samples = len(doclist)
    df = numDocsContaining(word, doclist)
    return np.log(n_samples / (1.+df))

my_idf_vector = [idf(word, mydoclist) for word in vocabulary]

print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
print 'The inverse document frequency vector is [' + ', '.join(format(freq, 'f') for freq in my_idf_vector) + ']'

# Output:

# Our vocabulary vector is [me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]
# The inverse document frequency vector is [0.000000, 0.405465, 0.000000, 0.405465, 0.000000, 0.000000, 0.405465, 0.405465, 0.405465, -0.287682, -0.287682]



\begin{equation} TF-IDF(\mbox{word})=TF(\mbox{word})\times IDF(\mbox{word})。 \end{equation}

doc_term_matrix_tfidf = []

#performing tf-idf matrix multiplication
for tf_vector in doc_term_matrix:
    doc_term_matrix_tfidf.append(np.multiply(tf_vector, my_idf_vector))

doc_term_matrix_tfidf_l2 = []
for tf_vector in doc_term_matrix_tfidf:
print vocabulary
print np.matrix(doc_term_matrix_tfidf_l2) # np.matrix() just to make it easier to look at

# Output:
# set(['me', 'basketball', 'Julie', 'baseball', 'likes', 'loves', 'Jane', 'Linda', 'He', 'than', 'more'])
# [[ 0.          0.          0.          0.          0.          0.          0.
#    0.70590555  0.         -0.50084796 -0.50084796]
#  [ 0.          0.          0.          0.          0.          0.
#    0.70590555  0.          0.         -0.50084796 -0.50084796]
#  [ 0.          0.49957476  0.          0.49957476  0.          0.          0.
#    0.          0.49957476 -0.35445393 -0.35445393]]


from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(min_df=1)
term_freq_matrix = count_vectorizer.fit_transform(mydoclist)
print "Vocabulary:", count_vectorizer.vocabulary_

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(norm="l2")

tf_idf_matrix = tfidf.transform(term_freq_matrix)
print tf_idf_matrix.todense()

# Output:
# Vocabulary: {u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}
# [[ 0.          0.          0.          0.          0.28945906  0.
#    0.38060387  0.57891811  0.57891811  0.22479078  0.22479078]
#  [ 0.          0.          0.          0.41715759  0.3172591   0.3172591
#    0.          0.3172591   0.6345182   0.24637999  0.24637999]
#  [ 0.48359121  0.48359121  0.48359121  0.          0.          0.36778358
#    0.          0.          0.          0.28561676  0.28561676]]




  1. [1]G. Salton, A. Wong, and C.-S. Yang, “A Vector Space Model for Automatic Indexing,” Communications of the ACM, vol. 18, no. 11, pp. 613–620, Nov. 1975. [Online]


  1. 代码主要参考了“The Vector Space Model of text”,但文中计算有错误。 

  2. $\log$的底是多少? 


上一篇:工业机器人(2):技术发展综述     下一篇:Python Essential