Indexing Based on Topic Modeling and MATHML for Building Vietnamese Technical Document Retrieval Effectively

Author(s):  
Tuan Cao Xuan ◽  
Linh Bui Khanh ◽  
Hung Vo Trung ◽  
Ha Nguyen Thi Thu ◽  
Tinh Dao Thanh
2020 ◽  
Author(s):  
Sunil Nagpal ◽  
Divyanshu Srivastava ◽  
Sharmila S. Mande

ABSTRACTTopic modeling is frequently employed for discovering structures (or patterns) in a corpus of documents. Its utility in text-mining and document retrieval tasks in various fields of scientific research is rather well known. An unsupervised machine learning approach, Latent Dirichlet Allocation (LDA) has particularly been utilized for identifying latent (or hidden) topics in document collections and for deciphering the words that define one or more topics using a generative statistical model. Here we describe how SARS-CoV-2 genomic mutation profiles can be structured into a ‘Bag of Words’ to enable identification of signatures (topics) and their probabilistic distribution across various genomes using LDA. Topic models were generated using ~47000 novel corona virus genomes (considered as documents), leading to identification of 16 amino acid mutation signatures and 18 nucleotide mutation signatures (equivalent to topics) in the corpus of chosen genomes through coherence optimization. The document assumption for genomes also helped in identification of contextual nucleotide mutation signatures in the form of conventional N-grams (e.g. bi-grams and tri-grams). We validated the signatures obtained using LDA driven method against the previously reported recurrent mutations and phylogenetic clades for genomes. Additionally, we report the geographical distribution of the identified mutation signatures in SARS-CoV-2 genomes on the global map. Use of the non-phylogenetic albeit classical approaches like topic modeling and other data centric pattern mining algorithms is therefore proposed for supplementing the efforts towards understanding the genomic diversity of the evolving SARS-CoV-2 genomes (and other pathogens/microbes).


Author(s):  
Maria A. Milkova

Nowadays the process of information accumulation is so rapid that the concept of the usual iterative search requires revision. Being in the world of oversaturated information in order to comprehensively cover and analyze the problem under study, it is necessary to make high demands on the search methods. An innovative approach to search should flexibly take into account the large amount of already accumulated knowledge and a priori requirements for results. The results, in turn, should immediately provide a roadmap of the direction being studied with the possibility of as much detail as possible. The approach to search based on topic modeling, the so-called topic search, allows you to take into account all these requirements and thereby streamline the nature of working with information, increase the efficiency of knowledge production, avoid cognitive biases in the perception of information, which is important both on micro and macro level. In order to demonstrate an example of applying topic search, the article considers the task of analyzing an import substitution program based on patent data. The program includes plans for 22 industries and contains more than 1,500 products and technologies for the proposed import substitution. The use of patent search based on topic modeling allows to search immediately by the blocks of a priori information – terms of industrial plans for import substitution and at the output get a selection of relevant documents for each of the industries. This approach allows not only to provide a comprehensive picture of the effectiveness of the program as a whole, but also to visually obtain more detailed information about which groups of products and technologies have been patented.


2017 ◽  
Vol 18 (4) ◽  
pp. 683-711
Author(s):  
Hyun-Jeong Park ◽  
Hanna Kim ◽  
YuJung Hong

2020 ◽  
Vol 16 (2) ◽  
pp. 83-115
Author(s):  
Mira Kim ◽  
◽  
Hye Sun Hwang ◽  
Xu Li

2019 ◽  
Vol 58 (6) ◽  
pp. 197-207
Author(s):  
Juhae Baeck ◽  
Hyungil Kwon ◽  
Mihwa Choi ◽  
Yi-Hsiu Lin

2017 ◽  
Vol 17 (2) ◽  
pp. 19-29
Author(s):  
Mi-Ae Kim ◽  
Chang-Kyo Suh

Sign in / Sign up

Export Citation Format

Share Document