What is the best practice way to normalize or weight document-feature-matrices by the length of dictionary entries. Here is some sample code. In reality, my example works with different issues e.g. health care, ethics, jobs, debts and deficits, etc. But to fully capture the different issues, some dictionaries have upwards of 10 entries, others 2 or 3.
Would it be best to divide each column in the dfm by the length of the respective dictionary entry?
quanteda library does contain a few ways to weight a dfm (e.g.
?dfm_weight) but I am not sure if any of those are best suited to this problem. so this question is as much a substantive as a technical question.
library(quanteda) data(quanteda) text<-data_char_ukimmig2010 corp<-corpus(text) mydict<-dictionary( list( jobs=c('job*'), culture=c('culture*', 'relig*', 'identity', 'language*') ) ) toks<-tokens(corp, remove_punct=T, remove_hyphens=T) mydfm<-dfm(toks, dictionary=mydict) mydfm