This is a presentation of two LDA theme analysis. The first one has been generated without an occurence filter and the second one with a filter. In our case, the first analysis tells us that the articles mostly deal with the epidemy or coronavirus in China, as expected.In the second analysis, the theme are better defined and gives us a better idea of the content of the articles besides the coronavirus epidemy. We must however keep in mind that the articles talks first about the theme extracted in the first one, especially the general extraction theme (0 in the theme selection).
Readers are strongly encouraged to use a chinese-english dictionnary browser extension (such as zhongwen) if translation is needed. I also translated the wordlists in french.
The source code to generate both LDA analysis are available on my GitHub. The whole process before that, downloading the articles and preparing them for the analysis is also available.
LDA analysis allows us to distributes the 4881 articles of the nCovMemory database into different themes. Each theme is represented with a wordlist, every word can appear into multiple lists. The proximity of the themes on the graph matters and the size too. Hovering on the words tells us in which theme that particular word is most influential. Other than that, the representations are quite easy to understand. For those wanting to know more about them see Chuang et.al and Sievert & Shirley two accademic articles describing this process.
without occurence filter (gensim filter extreme)
With occurence filter (gensim filter extremes ->no_above= 0.26)