scholarly journals Term Categorization Using Latent Semantic Analysis for Intelligent Query Processing

With the rapid improvement in the field of social networks, a huge amount of small size texts are generated within a fraction of a second. Understanding and categorizing these texts for effective query processing is considered as one of the vital defy in the field of Natural Language Processing. The objective is to retrieve only relevant documents by categorizing the short texts. In the proposed method, terms are categorized by means of Latent Semantic Analysis (LSA). Our novel method focuses on applying the semantic enrichment for term categorization with the target of augmenting the unstructured data items for achieving faster and intelligent query processing in the big data environment. Therefore, retrieval of documents can be made effective with the flexibility of query term mapping

This article examines the method of latent-semantic analysis, its advantages, disadvantages, and the possibility of further transformation for use in arrays of unstructured data, which make up most of the information that Internet users deal with. To extract context-dependent word meanings through the statistical processing of large sets of textual data, an LSA method is used, based on operations with numeric matrices of the word-text type, the rows of which correspond to words, and the columns of text units to texts. The integration of words into themes and the representation of text units in the theme space is accomplished by applying one of the matrix expansions to the matrix data: singular decomposition or factorization of nonnegative matrices. The results of LSA studies have shown that the content of the similarity of words and text is obtained in such a way that the results obtained closely coincide with human thinking. Based on the methods described above, the author has developed and proposed a new way of finding semantic links between unstructured data, namely, information on social networks. The method is based on latent-semantic and frequency analyzes and involves processing the search result received, splitting each remaining text (post) into separate words, each of which takes the round in n words right and left, counting the number of occurrences of each term, working with a pre-created semantic resource (dictionary, ontology, RDF schema, ...). The developed method and algorithm have been tested on six well-known social networks, the interaction of which occurs through the ARI of the respective social networks. The average score for author's results exceeded that of their own social network search. The results obtained in the course of this dissertation can be used in the development of recommendation, search and other systems related to the search, rubrication and filtering of information.


PLoS ONE ◽  
2017 ◽  
Vol 12 (5) ◽  
pp. e0177523 ◽  
Author(s):  
Mingxi Zhang ◽  
Pohan Li ◽  
Wei Wang

Natural Language Processing uses word embeddings to map words into vectors. Context vector is one of the techniques to map words into vectors. The context vector gives importance of terms in the document corpus. The derivation of context vector is done using various methods such as neural networks, latent semantic analysis, knowledge base methods etc. This paper proposes a novel system to devise an enhanced context vector machine called eCVM. eCVM is able to determine the context phrases and its importance. eCVM uses latent semantic analysis, existing context vector machine, dependency parsing, named entities, topics from latent dirichlet allocation and various forms of words like nouns, adjectives and verbs for building the context. eCVM uses context vector and Pagerank algorithm to find the importance of the term in document and is tested on BBC news dataset. Results of eCVM are compared with compared with the state of the art for context detrivation. The proposed system shows improved performance over existing systems for standard evaluation parameters.


Reusing the code with or without modification is common process in building all the large codebases of system software like Linux, gcc , and jdk. This process is referred to as software cloning or forking. Developers always find difficulty of bug fixes in porting large code base from one language to other native language during software porting. There exist many approaches in identifying software clones of same language that may not contribute for the developers involved in porting hence there is a need for cross language clone detector. This paper uses primary Natural Language Processing (NLP) approach using latent semantic analysis to find the cross language clones of other neighboring languages in terms of all 4 types of clones using latent semantic analysis algorithm that uses Singular value decomposition. It takes input as code(C, C++ or Java) and matches all the neighboring code clones in the static repository in terms of frequency of lines matched


2017 ◽  
Vol 16 (2) ◽  
pp. 179-217 ◽  
Author(s):  
Panagiotis Mazis ◽  
Andrianos Tsekrekos

Purpose The purpose of this paper is to analyze the content of the statements that are released by the Federal Open Market Committee (FOMC) after its meetings, identify the main textual associative patterns in the statements and examine their impact on the US treasury market. Design/methodology/approach Latent semantic analysis (LSA), a language processing technique that allows recognition of the textual associative patterns in documents, is applied to all the statements released by the FOMC between 2003 and 2014, so as to identify the main textual “themes” used by the Committee in its communication to the public. The importance of the main identified “themes” is tracked over time, before examining their (collective and individual) effect on treasury market yield volatility via time-series regression analysis. Findings We find that FOMC statements incorporate multiple, multifaceted and recurring textual themes, with six of them being able to characterize most of the communicated monetary policy in the authors’ sample period. The themes are statistically significant in explaining the variation in three-month, two-year, five-year and ten-year treasury yields, even after controlling for monetary policy uncertainty and the concurrent economic outlook. Research limitations/implications The main research implication of the authors’ study is that the LSA can successfully identify the most economically significant themes underlying the Fed’s communication, as the latter is expressed in monetary policy statements. The authors feel that the findings of the study would be strengthened if the analysis was repeated using intra-day (tick-by-tick or five-minute) data on treasury yields. Social implications The authors’ findings are consistent with the notion that the move to “increased transparency” by the Fed is important and meaningful for financial and capital markets, as suggested by the significant effect that the most important identified textual themes have on treasury yield volatility. Originality/value This paper makes a timely contribution to a fairly recent stream of research that combines specific textual and statistical techniques so as to conduct content analysis. To the best of their knowledge, the authors’ study is the first that applies the LSA to the statements released by the FOMC.


2021 ◽  
Author(s):  
João Alberto da Silva Amaral ◽  
Fernando Buarque Lima Neto

This paper proposes a semantic Natural Language Processing (NLP) approach used to assist in the automated characterization of information relevant to compliance activities. In this context, the Latent Semantic Analysis (LSA) technique was used to assist in the dimensionality reduction process. The evaluated results were achieved through the submission of two databases to the model, namely: Database of Audit reports issued by the State General Secretariat of Management (SCGE – Secretaria da Controladoria-Geral do Estado, in Portuguese) of Pernambuco between the years of 2010 to 2019 and a Base of Appellate Decisions issued by the Brazilian Federal Accountability Office (TCU – Tribunal de Contas da União, in Portuguese) in 2019. The performance of two dimensionality reduction methods was evaluated: Tf-idf and LSA. To validate the results, K-means was used as a clustering technique. In addition, it was observed that the Silhouette technique helped us find the best cluster value for a given data sample. In the results, LSA associated with K-means presented the best performance in both databases, having achieved the best results in the TCU Base of Appellate Decisions.


2020 ◽  
pp. 1162-1177
Author(s):  
Eya Boukchina ◽  
Sehl Mellouli ◽  
Emna Menif

Citizens' participation is a form of democracy in which citizens are part of the decision-making process with regard to the development of their society. In today's emergence of Information and Communication Technologies, citizens can participate in these processes by submitting inputs through digital media such as social media platforms or dedicated websites. From these different means, a high quantity of data, of different forms (text, image, video), can be generated. This data needs to be processed in order to extract valuable data that can be used by a city's decision-makers. This paper presents natural language processing techniques to extract valuable information from comments posted by citizens. It applies the Latent Semantic Analysis on a corpus of citizens' comments to automatically identify the subjects that were raised by citizens.


Sign in / Sign up

Export Citation Format

Share Document