How do I normalize or weight document-feature-matrix by length of dictionary entries

by spindoctor   Last Updated October 11, 2018 15:19 PM

What is the best practice way to normalize or weight document-feature-matrices by the length of dictionary entries. Here is some sample code. In reality, my example works with different issues e.g. health care, ethics, jobs, debts and deficits, etc. But to fully capture the different issues, some dictionaries have upwards of 10 entries, others 2 or 3.

Would it be best to divide each column in the dfm by the length of the respective dictionary entry?

The quanteda library does contain a few ways to weight a dfm (e.g. ?dfm_weight) but I am not sure if any of those are best suited to this problem. so this question is as much a substantive as a technical question.

library(quanteda)
data(quanteda)
text<-data_char_ukimmig2010
corp<-corpus(text)
mydict<-dictionary(
  list(
    jobs=c('job*'),
    culture=c('culture*', 'relig*', 'identity', 'language*')
  )
)

toks<-tokens(corp, remove_punct=T, remove_hyphens=T)

mydfm<-dfm(toks, dictionary=mydict)

mydfm


Related Questions




adding frequency of features get better results

Updated June 06, 2017 20:19 PM


Sentimental Analysis using NLTK in python

Updated February 19, 2017 09:19 AM