How do I normalize or weight document-feature-matrix by length of dictionary entries

by spindoctor   Last Updated October 11, 2018 15:19 PM - source

What is the best practice way to normalize or weight document-feature-matrices by the length of dictionary entries. Here is some sample code. In reality, my example works with different issues e.g. health care, ethics, jobs, debts and deficits, etc. But to fully capture the different issues, some dictionaries have upwards of 10 entries, others 2 or 3.

Would it be best to divide each column in the dfm by the length of the respective dictionary entry?

The quanteda library does contain a few ways to weight a dfm (e.g. ?dfm_weight) but I am not sure if any of those are best suited to this problem. so this question is as much a substantive as a technical question.

    culture=c('culture*', 'relig*', 'identity', 'language*')

toks<-tokens(corp, remove_punct=T, remove_hyphens=T)

mydfm<-dfm(toks, dictionary=mydict)


Related Questions

Sentiment Analysis Issues Using R

Updated March 07, 2019 17:19 PM

adding frequency of features get better results

Updated June 06, 2017 20:19 PM