Latent Semantic Analysis and Beyond

Author(s):  
Anne Kao

Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI), when applied to information retrieval, has been a major analysis approach in text mining. It is an extension of the vector space method in information retrieval, representing documents as numerical vectors but using a more sophisticated mathematical approach to characterize the essential features of the documents and reduce the number of features in the search space. This chapter summarizes several major approaches to this dimensionality reduction, each of which has strengths and weaknesses, and it describes recent breakthroughs and advances. It shows how the constructs and products of LSA applications can be made user-interpretable and reviews applications of LSA beyond information retrieval, in particular, to text information visualization.

Author(s):  
Anne Kao ◽  
Steve Poteet ◽  
Jason Wu ◽  
William Ferng ◽  
Rod Tjoelker ◽  
...  

Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI), when applied to information retrieval, has been a major analysis approach in text mining. It is an extension of the vector space method in information retrieval, representing documents as numerical vectors but using a more sophisticated mathematical approach to characterize the essential features of the documents and reduce the number of features in the search space. This chapter summarizes several major approaches to this dimensionality reduction, each of which has strengths and weaknesses, and it describes recent breakthroughs and advances. It shows how the constructs and products of LSA applications can be made user-interpretable and reviews applications of LSA beyond information retrieval, in particular, to text information visualization. While the major application of LSA is for text mining, it is also highly applicable to cross-language information retrieval, Web mining, and analysis of text transcribed from speech and textual information in video.


2012 ◽  
pp. 174-190
Author(s):  
Michael W. Berry ◽  
Reed Esau ◽  
Bruce Kiefer

Electronic discovery (eDiscovery) is the process of collecting and analyzing electronic documents to determine their relevance to a legal matter. Office technology has advanced and eased the requirements necessary to create a document. As such, the volume of data has outgrown the manual processes previously used to make relevance judgments. Methods of text mining and information retrieval have been put to use in eDiscovery to help tame the volume of data; however, the results have been uneven. This chapter looks at the historical bias of the collection process. The authors examine how tools like classifiers, latent semantic analysis, and non-negative matrix factorization deal with nuances of the collection process.


In this study we propose an automatic single document text summarization technique using Latent Semantic Analysis (LSA) and diversity constraint in combination. The proposed technique uses the query based sentence ranking. Here we are not considering the concept of IR (Information Retrieval) so we generate the query by using the TF-IDF(Term Frequency-Inverse Document Frequency). For producing the query vector, we identify the terms having the high IDF. We know that LSA utilizes the vectorial semantics to analyze the relationships between documents in a corpus or between sentences within a document and key terms they carry by producing a list of ideas interconnected to the documents and terms. LSA helps to represent the latent structure of documents. For selecting the sentences from the document Latent Semantic Indexing (LSI) is used. LSI helps to arrange the sentences with its score. Traditionally the highest score sentences have been chosen for summary but here we calculate the diversity between chosen sentences and produce the final summary as a good summary should have maximum level of diversity. The proposed technique is evaluated on OpinosisDataset1.0.


2014 ◽  
Vol 4 (3) ◽  
pp. 1-13
Author(s):  
Khadoudja Ghanem

In this paper the authors propose a semantic approach to document categorization. The idea is to create for each category a semantic index (representative term vector) by performing a local Latent Semantic Analysis (LSA) followed by a clustering process. A second use of LSA (Global LSA) is adopted on a term-Class matrix in order to retrieve the class which is the most similar to the query (document to classify) in the same way where the LSA is used to retrieve documents which are the most similar to a query in Information Retrieval. The proposed system is evaluated on a popular dataset which is 20 Newsgroup corpus. Obtained results show the effectiveness of the method compared with those obtained with the classic KNN and SVM classifiers as well as with methods presented in the literature. Experimental results show that the new method has high precision and recall rates and classification accuracy is significantly improved.


2012 ◽  
Vol 12 (1) ◽  
pp. 34-48 ◽  
Author(s):  
Ch. Aswani Kumar ◽  
M. Radvansky ◽  
J. Annapurna

Abstract Latent Semantic Indexing (LSI), a variant of classical Vector Space Model (VSM), is an Information Retrieval (IR) model that attempts to capture the latent semantic relationship between the data items. Mathematical lattices, under the framework of Formal Concept Analysis (FCA), represent conceptual hierarchies in data and retrieve the information. However, both LSI and FCA use the data represented in the form of matrices. The objective of this paper is to systematically analyze VSM, LSI and FCA for the task of IR using standard and real life datasets.


Sign in / Sign up

Export Citation Format

Share Document