scholarly journals TEXT CLUSTERING IN CONCEPT BASED MINING

Author(s):  
PRADNYA S. RANDIVE ◽  
NITIN N. PISE

In text mining most techniques depends on statistical analysis of terms. Statistical analysis trances important terms within document only. However this concept based mining model analyses terms in sentence, document and corpus level. This mining model consist of sentence based concept analysis, document based and corpus based concept analysis and concept based similarity measure. Experimental result enhances text clustering quality by using sentence, document, corpus and combined approach of concept analysis.

Author(s):  
Shady Shehata ◽  
Fakhri Karray ◽  
Mohamed Kamel

Most of text mining techniques are based on word and/or phrase analysis of the text. Statistical analysis of a term frequency captures the importance of the term within a document only. However, two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term. Thus, the underlying model should indicate terms that capture the semantics of text. In this case, the model can capture terms that present the concepts of the sentence, which leads to discover the topic of the document. A new concept-based mining model that relies on the analysis of both the sentence and the document, rather than, the traditional analysis of the document dataset only is introduced. The concept-based model can effectively discriminate between non-important terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The proposed model consists of concept-based statistical analyzer, conceptual ontological graph representation, and concept extractor. The term which contributes to the sentence semantics is assigned two different weights by the concept-based statistical analyzer and the conceptual ontological graph representation. These two weights are combined into a new weight. The concepts that have maximum combined weights are selected by the concept extractor. The concept-based model is used to enhance the quality of the text clustering, categorization and retrieval significantly.


2010 ◽  
Vol 22 (10) ◽  
pp. 1360-1371 ◽  
Author(s):  
Shady Shehata ◽  
Fakhri Karray ◽  
Mohamed Kamel

2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Ali A. Amer ◽  
Hassan I. Abdalla

Abstract Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challenge. This study, in consequence, introduces a new highly-effective and time-efficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a comprehensive scrutinization for seven of the most widely used similarity measures, mainly concerning their effectiveness and efficiency. Using the K-nearest neighbor algorithm (KNN) for classification, the K-means algorithm for clustering, and the bag of word (BoW) model for feature selection, all similarity measures are carefully examined in detail. The experimental evaluation has been made on two of the most popular datasets, namely, Reuters-21 and Web-KB. The obtained results confirm that the proposed set theory-based similarity measure (STB-SM), as a pre-eminent measure, outweighs all state-of-art measures significantly with regards to both effectiveness and efficiency.


2018 ◽  
Vol 7 (2.18) ◽  
pp. 102
Author(s):  
Harsha Patil ◽  
Ramjeevan Singh Thakur

Document Clustering is an unsupervised method for classified documents in clusters on the basis of their similarity. Any document get it place in any specific cluster, on the basis of membership score, which calculated through membership function. But many of the traditional clustering algorithms are generally based on only BOW (Bag of Words), which ignores the semantic similarity between document and Cluster. In this research we consider the semantic association between cluster and text document during the calculation of membership score of any document for any specific cluster. Several researchers are working on semantic aspects of document clustering to develop clustering performance. Many external knowledge bases like WordNet, Wikipedia, Lucene etc. are utilized for this purpose. The proposed approach exploits WordNet to improve cluster member ship function. The experimental result shows that clustering quality improved significantly by using proposed framework of semantic approach. 


2012 ◽  
Vol 532-533 ◽  
pp. 1716-1720 ◽  
Author(s):  
Chun Xia Jin ◽  
Hai Yan Zhou ◽  
Qiu Chan Bai

To solve the problem of sparse keywords and similarity drift in short text segments, this paper proposes short text clustering algorithm with feature keyword expansion (STCAFKE). The method can realize short text clustering by expanding feature keyword based on HowNet and combining K-means algorithm and density algorithm. It may add the number of text keyword with feature keyword expansion and increase text semantic features to realize short text clustering. Experimental results show that this algorithm has increased the short text clustering quality on precision and recall.


2016 ◽  
Vol 2016 ◽  
pp. 1-11 ◽  
Author(s):  
Ravindra E. Chaudhari ◽  
Sanjay B. Dhok

Fast normalized covariance based similarity measure for fractal video compression with quadtree partitioning is proposed in this paper. To increase the speed of fractal encoding, a simplified expression of covariance between range and overlapped domain blocks within a search window is implemented in frequency domain. All the covariance coefficients are normalized by using standard deviation of overlapped domain blocks and these are efficiently calculated in one computation by using two different approaches, namely, FFT based and sum table based. Results of these two approaches are compared and they are almost equal to each other in all aspects, except the memory requirement. Based on proposed simplified similarity measure, gray level transformation parameters are computationally modified and isometry transformations are performed using rotation/reflection properties of IFFT. Quadtree decompositions are used for the partitions of larger size of range block, that is, 16 × 16, which is based on target level of motion compensated prediction error. Experimental result shows that proposed method can increase the encoding speed and compression ratio by 66.49% and 9.58%, respectively, as compared to NHEXS method with increase in PSNR by 0.41 dB. Compared to H.264, proposed method can save 20% of compression time with marginal variation in PSNR and compression ratio.


Sign in / Sign up

Export Citation Format

Share Document