Evaluation of stop word lists in text retrieval using Latent Semantic Indexing

Author(s):  
A. N. K. Zaman ◽  
Pascal Matsakis ◽  
Charles Brown
1997 ◽  
Vol 3 (4) ◽  
pp. 261-287 ◽  
Author(s):  
Michael L. Best

I introduce a new alife model, an ecology based on a corpus of text, and apply it to the analysis of posts to USENET News. In this corporal ecology posts are organisms, the newsgroups of NetNews define an environment, and human posters situated in their wider context make up a scarce resource. I apply latent semantic indexing (LSI), a text retrieval method based on principal component analysis, to distill from the corpus those replicating units of text. LSI arrives at suitable replicators because it discovers word co-occurrences that segregate and recombine with appreciable frequency. I argue that natural selection is necessarily in operation because sufficient conditions for its occurrence are met: replication, mutagenicity, and trait/fitness covariance. I describe a set of experiments performed on a static corpus of over 10,000 posts. In these experiments I study average population fitness, a fundamental element of population ecology. My study of fitness arrives at the tinhappy discovery that a flame-war, centered around an overly prolific poster, is the king of the jungle.


2008 ◽  
Vol 7 (1) ◽  
pp. 182-191 ◽  
Author(s):  
Sebastian Klie ◽  
Lennart Martens ◽  
Juan Antonio Vizcaíno ◽  
Richard Côté ◽  
Phil Jones ◽  
...  

2011 ◽  
Vol 181-182 ◽  
pp. 830-835
Author(s):  
Min Song Li

Latent Semantic Indexing(LSI) is an effective feature extraction method which can capture the underlying latent semantic structure between words in documents. However, it is probably not the most appropriate for text categorization to use the method to select feature subspace, since the method orders extracted features according to their variance,not the classification power. We proposed a method based on support vector machine to extract features and select a Latent Semantic Indexing that be suited for classification. Experimental results indicate that the method improves classification performance with more compact representation.


2021 ◽  
Vol 12 (4) ◽  
pp. 169-185
Author(s):  
Saida Ishak Boushaki ◽  
Omar Bendjeghaba ◽  
Nadjet Kamel

Clustering is an important unsupervised analysis technique for big data mining. It finds its application in several domains including biomedical documents of the MEDLINE database. Document clustering algorithms based on metaheuristics is an active research area. However, these algorithms suffer from the problems of getting trapped in local optima, need many parameters to adjust, and the documents should be indexed by a high dimensionality matrix using the traditional vector space model. In order to overcome these limitations, in this paper a new documents clustering algorithm (ASOS-LSI) with no parameters is proposed. It is based on the recent symbiotic organisms search metaheuristic (SOS) and enhanced by an acceleration technique. Furthermore, the documents are represented by semantic indexing based on the famous latent semantic indexing (LSI). Conducted experiments on well-known biomedical documents datasets show the significant superiority of ASOS-LSI over five famous algorithms in terms of compactness, f-measure, purity, misclassified documents, entropy, and runtime.


Sign in / Sign up

Export Citation Format

Share Document