Gensim LDA Topic-Term matrix all Zero

by Yuejiang_Li   Last Updated June 30, 2020 02:19 AM - source

I meet a confusing problem when using gensim.models.ldamodel for topic modeling. I have cleaned my documents set and extract the dictionary as suggested in LDA Tutorial. Next, I train the LDA model with

lda = ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=80)
for topic in lda.print_topics(num_topics=80, num_words=5):
    print(topic)

And the results seem to be good:

(0, '0.027*"预约" + 0.022*"航班" + 0.018*"购买" + 0.017*"乘客" + 0.015*"市民"')
(1, '0.013*"京东" + 0.012*"办理" + 0.011*"沈阳市" + 0.010*"严厉" + 0.009*"违法"')
(2, '0.022*"喇叭" + 0.020*"宣传" + 0.019*"防疫" + 0.014*"工作" + 0.012*"应急预案"')
...

The next step is to determine the topic number with coherence and perplexity. When I choose the number of topics to be 160, The result is quite weird:

(95, '0.000*"隐形眼镜" + 0.000*"益菌" + 0.000*"退热药" + 0.000*"口腔温度" + 0.000*"餐巾纸"')
(78, '0.000*"隐形眼镜" + 0.000*"益菌" + 0.000*"退热药" + 0.000*"口腔温度" + 0.000*"餐巾纸"')
(18, '0.000*"隐形眼镜" + 0.000*"益菌" + 0.000*"退热药" + 0.000*"口腔温度" + 0.000*"餐巾纸"')
(8, '0.000*"隐形眼镜" + 0.000*"益菌" + 0.000*"退热药" + 0.000*"口腔温度" + 0.000*"餐巾纸"')
(157, '0.000*"隐形眼镜" + 0.000*"益菌" + 0.000*"退热药" + 0.000*"口腔温度" + 0.000*"餐巾纸"')
...

Actually, all topics are similar: the score over each term in each topic is 0!. I do not know if there is bug here, or I use the wrong training parameters.



Related Questions




Topic modelling with a single topic per word?

Updated January 13, 2019 21:19 PM