Similarity Detection Using Latent Semantic Analysis Algorithm

Similarity View is an application for visually comparing and exploring multiple models of text and collection of document. Friendbook finds ways of life of clients from client driven sensor information, measures the closeness of ways of life amongst clients, and prescribes companions to clients if their ways of life have high likeness. Roused by demonstrate a clients day by day life as life records, from their ways of life are separated by utilizing the Latent Dirichlet Allocation Algorithm. Manual techniques can't be utilized for checking research papers, as the doled out commentator may have lacking learning in the exploration disciplines. For different subjective views, causing possible misinterpretations. An urgent need for an effective and feasible approach to check the submitted research papers with support of automated software. A method like text mining method come to solve the problem of automatically checking the research papers semantically. The proposed method to finding the proper similarity of text from the collection of documents by using Latent Dirichlet Allocation (LDA) algorithm and Latent Semantic Analysis (LSA) with synonym algorithm which is used to find synonyms of text index wise by using the English wordnet dictionary, another algorithm is LSA without synonym used to find the similarity of text based on index. LSA with synonym rate of accuracy is greater when the synonym are consider for matching.

Download Full-text

A comparative analysis of Latent Semantic analysis and Latent Dirichlet allocation topic modeling methods using Bible data

Indian Journal of Science and Technology ◽

10.17485/ijst/v13i44.1479 ◽

2020 ◽

Vol 13 (44) ◽

pp. 4474-4482

Author(s):

Vasantha Kumari Garbhapu ◽

Keyword(s):

Latent Semantic Analysis ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Word Association ◽

Machine Learning Algorithms ◽

Superior Performance ◽

Data Set ◽

Document Similarity ◽

Dirichlet Allocation

Objective: To compare the topic modeling techniques, as no free lunch theorem states that under a uniform distribution over search problems, all machine learning algorithms perform equally. Hence, here, we compare Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA) to identify better performer for English bible data set which has not been studied yet. Methods: This comparative study divided into three levels: In the first level, bible data was extracted from the sources and preprocessed to remove the words and characters which were not useful to obtain the semantic structures or necessary patterns to make the meaningful corpus. In the second level, the preprocessed data were converted into a bag of words and numerical statistic TF-IDF (Term Frequency – Inverse Document Frequency) is used to assess how relevant a word is to a document in a corpus. In the third level, Latent Semantic analysis and Latent Dirichlet Allocations methods were applied over the resultant corpus to study the feasibility of the techniques. Findings: Based on our evaluation, we observed that the LDA achieves 60 to 75% superior performance when compared to LSA using document similarity within-corpus, document similarity with the unseen document. Additionally, LDA showed better coherence score (0.58018) than LSA (0.50395). Moreover, when compared to any word within-corpus, the word association showed better results with LDA. Some words have homonyms based on the context; for example, in the bible; bear has a meaning of punishment and birth. In our study, LDA word association results are almost near to human word associations when compared to LSA. Novelty: LDA was found to be the computationally efficient and interpretable method in adopting the English Bible dataset of New International Version that was not yet created. Keywords: Topic modeling; LSA; LDA; word association; document similarity;Bible data set

Download Full-text

Topic modeling Twitter data using Latent Dirichlet Allocation and Latent Semantic Analysis

10.1063/1.5139825 ◽

2019 ◽

Author(s):

Siti Qomariyah ◽

Nur Iriawan ◽

Kartika Fithriasari

Keyword(s):

Latent Semantic Analysis ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Twitter Data ◽

Dirichlet Allocation

Download Full-text

Experimenting with Latent Semantic Analysis and Latent Dirichlet Allocation on Automated Essay Grading

2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS) ◽

10.1109/snams52053.2020.9336533 ◽

2020 ◽

Author(s):

Jalaa Hoblos

Keyword(s):

Latent Semantic Analysis ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Essay Grading ◽

Dirichlet Allocation

Download Full-text

Accuracy of Unit Under Test Identification Using Latent Semantic Analysis and Latent Dirichlet Allocation

2019 IEEE 15th International Scientific Conference on Informatics ◽

10.1109/informatics47936.2019.9119262 ◽

2019 ◽

Author(s):

Matej Madeja ◽

Jaroslav Poruban

Keyword(s):

Latent Semantic Analysis ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Dirichlet Allocation

Download Full-text

Thematic Context Derivator Algorithm for Enhanced Context Vector Machine: eCVM

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b4564.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 4872-4877

Keyword(s):

Language Processing ◽

Latent Semantic Analysis ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Named Entities ◽

Pagerank Algorithm ◽

Context Vector ◽

Improved Performance ◽

Evaluation Parameters ◽

Thematic Context

Natural Language Processing uses word embeddings to map words into vectors. Context vector is one of the techniques to map words into vectors. The context vector gives importance of terms in the document corpus. The derivation of context vector is done using various methods such as neural networks, latent semantic analysis, knowledge base methods etc. This paper proposes a novel system to devise an enhanced context vector machine called eCVM. eCVM is able to determine the context phrases and its importance. eCVM uses latent semantic analysis, existing context vector machine, dependency parsing, named entities, topics from latent dirichlet allocation and various forms of words like nouns, adjectives and verbs for building the context. eCVM uses context vector and Pagerank algorithm to find the importance of the term in document and is tested on BBC news dataset. Results of eCVM are compared with compared with the state of the art for context detrivation. The proposed system shows improved performance over existing systems for standard evaluation parameters.

Download Full-text

Methodologically grounded semantic analysis of large volume of chilean medical literature data applied to the analysis of medical research funding efficiency in Chile

Journal of Biomedical Semantics ◽

10.1186/s13326-020-00226-w ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Patricio Wolff ◽

Sebastián Ríos ◽

David Clavijo ◽

Manuel Graña ◽

Miguel Carrasco

Keyword(s):

Medical Research ◽

Large Volume ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Medical Literature ◽

National Level ◽

Scientific Research ◽

Source Analysis ◽

Research Papers ◽

The Impact

Abstract Background Medical knowledge is accumulated in scientific research papers along time. In order to exploit this knowledge by automated systems, there is a growing interest in developing text mining methodologies to extract, structure, and analyze in the shortest time possible the knowledge encoded in the large volume of medical literature. In this paper, we use the Latent Dirichlet Allocation approach to analyze the correlation between funding efforts and actually published research results in order to provide the policy makers with a systematic and rigorous tool to assess the efficiency of funding programs in the medical area. Results We have tested our methodology in the Revista Médica de Chile, years 2012-2015. 50 relevant semantic topics were identified within 643 medical scientific research papers. Relationships between the identified semantic topics were uncovered using visualization methods. We have also been able to analyze the funding patterns of scientific research underlying these publications. We found that only 29% of the publications declare funding sources, and we identified five topic clusters that concentrate 86% of the declared funds. Conclusions Our methodology allows analyzing and interpreting the current state of medical research at a national level. The funding source analysis may be useful at the policy making level in order to assess the impact of actual funding policies, and to design new policies.

Download Full-text

Semantic Pattern Detection in Covid-19 using Contextual Clustering and Intelligent Topic Modeling

International Journal of E-Health and Medical Communications ◽

10.4018/ijehmc.20220701oa07 ◽

2022 ◽

Vol 13 (2) ◽

pp. 0-0

Keyword(s):

Latent Semantic Analysis ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Pattern Detection ◽

Future Research ◽

Detection Approach ◽

Semantic Spaces ◽

And Control ◽

The Impact

The Covid-19 pandemic is the deadliest outbreak in our living memory. So, it is need of hour, to prepare the world with strategies to prevent and control the impact of the epidemics. In this paper, a novel semantic pattern detection approach in the Covid-19 literature using contextual clustering and intelligent topic modeling is presented. For contextual clustering, three level weights at term level, document level, and corpus level are used with latent semantic analysis. For intelligent topic modeling, semantic collocations using pointwise mutual information(PMI) and log frequency biased mutual dependency(LBMD) are selected and latent dirichlet allocation is applied. Contextual clustering with latent semantic analysis presents semantic spaces with high correlation in terms at corpus level. Through intelligent topic modeling, topics are improved in the form of lower perplexity and highly coherent. This research helps in finding the knowledge gap in the area of Covid-19 research and offered direction for future research.

Download Full-text

Issues and Methods for Access, Storage, and Analysis of Data From Online Social Communities

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch015 ◽

2018 ◽

pp. 402-432

Author(s):

Christopher John Quinn ◽

Matthew James Quinn ◽

Alan Olinsky ◽

John Thomas Quinn

Keyword(s):

Social Network ◽

Data Storage ◽

Latent Semantic Analysis ◽

Information Diffusion ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Online Social Network ◽

Network Models ◽

Probabilistic Latent Semantic Analysis ◽

User Interactions

This chapter provides an overview for a number of important issues related to studying user interactions in an online social network. The approach of social network analysis is detailed along with important basic concepts for network models. The different ways of indicating influence within a network are provided by describing various measures such as degree centrality, betweenness centrality and closeness centrality. Network structure as represented by cliques and components with measures of connectedness defined by clustering and reciprocity are also included. With the large volume of data associated with social networks, the significance of data storage and sampling are discussed. Since verbal communication is significant within networks, textual analysis is reviewed with respect to classification techniques such as sentiment analysis and with respect to topic modeling specifically latent semantic analysis, probabilistic latent semantic analysis, latent Dirichlet allocation and alternatives. Another important area that is provided in detail is information diffusion.

Download Full-text

Renormalization Analysis of Topic Models

Entropy ◽

10.3390/e22050556 ◽

2020 ◽

Vol 22 (5) ◽

pp. 556

Author(s):

Sergei Koltcov ◽

Vera Ignatenko

Keyword(s):

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Optimal Number ◽

Probabilistic Latent Semantic Analysis ◽

Model Parameters ◽

Grid Search ◽

Renormalization Procedure ◽

Allocation Model ◽

Latent Dirichlet Allocation Model ◽

Dirichlet Allocation

In practice, to build a machine learning model of big data, one needs to tune model parameters. The process of parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory of statistical physics provides techniques allowing us to optimize this process. The paper shows that a function of the output of topic modeling demonstrates self-similar behavior under variation of the number of clusters. Such behavior allows using a renormalization technique. A combination of renormalization procedure with the Renyi entropy approach allows for quick searching of the optimal number of topics. In this paper, the renormalization procedure is developed for the probabilistic Latent Semantic Analysis (pLSA), and the Latent Dirichlet Allocation model with variational Expectation–Maximization algorithm (VLDA) and the Latent Dirichlet Allocation model with granulated Gibbs sampling procedure (GLDA). The experiments were conducted on two test datasets with a known number of topics in two different languages and on one unlabeled test dataset with an unknown number of topics. The paper shows that the renormalization procedure allows for finding an approximation of the optimal number of topics at least 30 times faster than the grid search without significant loss of quality.

Download Full-text

Estimating Topic Modeling Performance with Sharma–Mittal Entropy

Entropy ◽

10.3390/e21070660 ◽

2019 ◽

Vol 21 (7) ◽

pp. 660 ◽

Cited By ~ 6

Author(s):

Sergei Koltcov ◽

Vera Ignatenko ◽

Olessia Koltsova

Keyword(s):

Latent Semantic Analysis ◽

Topic Modeling ◽

Statistical Physics ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Probabilistic Latent Semantic Analysis ◽

Model Parameters ◽

Theoretical Ground ◽

Text Documents ◽

Optimizing Model

Topic modeling is a popular approach for clustering text documents. However, current tools have a number of unsolved problems such as instability and a lack of criteria for selecting the values of model parameters. In this work, we propose a method to solve partially the problems of optimizing model parameters, simultaneously accounting for semantic stability. Our method is inspired by the concepts from statistical physics and is based on Sharma–Mittal entropy. We test our approach on two models: probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) with Gibbs sampling, and on two datasets in different languages. We compare our approach against a number of standard metrics, each of which is able to account for just one of the parameters of our interest. We demonstrate that Sharma–Mittal entropy is a convenient tool for selecting both the number of topics and the values of hyper-parameters, simultaneously controlling for semantic stability, which none of the existing metrics can do. Furthermore, we show that concepts from statistical physics can be used to contribute to theory construction for machine learning, a rapidly-developing sphere that currently lacks a consistent theoretical ground.

Download Full-text