Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection

Humans are used to expressing themselves with written language and language provides a medium with which we can describe our experiences in detail incorporating individuality. Even though documents provide a rich source of information, it becomes very difficult to identify, extract, summarize and search when vast amounts of documents are collected especially over time. Document clustering is a technique that has been widely used to group documents based on similarity of content represented by the words used. Once key groups are identified further drill down into sub-groupings is facilitated by the use of hierarchical clustering. Clustering and hierarchical clustering are very useful when applied to numerical and categorical data and cluster accuracy and purity measures exist to evaluate the outcomes of a clustering exercise. Although the same measures have been applied to text clustering, text clusters are based on words or terms which can be repeated across documents associated with different topics. Therefore text data cannot be considered as a direct ‘coding’ of a particular experience or situation in contrast to numerical and categorical data and term overlap is a very common characteristic in text clustering. In this paper we propose a new technique and methodology for term overlap capture from text documents, highlighting the different situations such overlap could signify and discuss why such understanding is important for obtaining value from text clustering. Experiments were conducted using a widely used text document collection where the proposed methodology allowed exploring the term diversity for a given document collection and obtain clusters with minimum term overlap.

Download Full-text

An Improved B-hill Climbing Optimization Technique for Solving the Text Documents Clustering Problem

Current Medical Imaging Formerly Current Medical Imaging Reviews ◽

10.2174/1573405614666180903112541 ◽

2020 ◽

Vol 16 (4) ◽

pp. 296-306 ◽

Cited By ~ 3

Author(s):

Laith Mohammad Abualigah ◽

Essam Said Hanandeh ◽

Ahamad Tajudin Khader ◽

Mohammed Abdallh Otair ◽

Shishir Kumar Shandilya

Keyword(s):

Optimization Technique ◽

Document Clustering ◽

Text Clustering ◽

Hill Climbing ◽

Text Documents ◽

Clustering Problem ◽

Text Document ◽

Text Information ◽

Amount Of Knowledge ◽

The Hill

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.

Download Full-text

Assessment of Twitter Data Clusters with Cosine-Based Validation Metrics Using Hybrid Topic Models

Ingénierie des systèmes d information ◽

10.18280/isi.250606 ◽

2020 ◽

Vol 25 (6) ◽

pp. 755-769

Author(s):

Noorullah R. Mohammed ◽

Moulana Mohammed

Keyword(s):

Data Clustering ◽

Topic Models ◽

Cluster Validity ◽

Text Documents ◽

Text Data ◽

Validity Assessment ◽

Text Document ◽

Cluster Validity Indices ◽

Validity Indices ◽

Data Clusters

Text data clustering is performed for organizing the set of text documents into the desired number of coherent and meaningful sub-clusters. Modeling the text documents in terms of topics derivations is a vital task in text data clustering. Each tweet is considered as a text document, and various topic models perform modeling of tweets. In existing topic models, the clustering tendency of tweets is assessed initially based on Euclidean dissimilarity features. Cosine metric is more suitable for more informative assessment, especially of text clustering. Thus, this paper develops a novel cosine based external and interval validity assessment of cluster tendency for improving the computational efficiency of tweets data clustering. In the experimental, tweets data clustering results are evaluated using cluster validity indices measures. Experimentally proved that cosine based internal and external validity metrics outperforms the other using benchmarked and Twitter-based datasets.

Download Full-text

Dual Scaling in Data Mining from Text Databases

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2006.p0451 ◽

2006 ◽

Vol 10 (4) ◽

pp. 451-457 ◽

Cited By ~ 3

Author(s):

Junzo Watada ◽

◽

Keisuke Aoki ◽

Masahiro Kawano ◽

Muhammad Suzuri Hitam ◽

...

Keyword(s):

Multivariate Analysis ◽

Text Mining ◽

Kansei Engineering ◽

Semantic Meaning ◽

Dual Scaling ◽

Text Documents ◽

Text Data ◽

Text Document ◽

Text Information ◽

Quantification Model

The availability of multimedia text document information has disseminated text mining among researchers. Text documents, integrate numerical and linguistic data, making text mining interesting and challenging. We propose text mining based on a fuzzy quantification model and fuzzy thesaurus. In text mining, we focus on: 1) Sentences included in Japanese text that are broken down into words. 2) Fuzzy thesaurus for finding words matching keywords in text. 3) Fuzzy multivariate analysis to analyze semantic meaning in predefined case studies. We use a fuzzy thesaurus to translate words using Chinese and Japanese characters into keywords. This speeds up processing without requiring a dictionary to separate words. Fuzzy multivariate analysis is used to analyze such processed data and to extract latent mutual related structures in text data, i.e., to extract otherwise obscured knowledge. We apply dual scaling to mining library and Web page text information, and propose integrating the result in Kansei engineering for possible application in sales, marketing, and production.

Download Full-text

Nature-Inspired Optimization Algorithms for Text Document Clustering—A Comprehensive Analysis

Algorithms ◽

10.3390/a13120345 ◽

2020 ◽

Vol 13 (12) ◽

pp. 345

Author(s):

Laith Abualigah ◽

Amir H. Gandomi ◽

Mohamed Abd Elaziz ◽

Abdelazim G. Hussien ◽

Ahmad M. Khasawneh ◽

...

Keyword(s):

Optimization Problems ◽

Optimization Algorithms ◽

Harmony Search ◽

Document Clustering ◽

Text Clustering ◽

Gray Wolf ◽

Text Documents ◽

Text Document ◽

Krill Herd ◽

Clustering Problems

Text clustering is one of the efficient unsupervised learning techniques used to partition a huge number of text documents into a subset of clusters. In which, each cluster contains similar documents and the clusters contain dissimilar text documents. Nature-inspired optimization algorithms have been successfully used to solve various optimization problems, including text document clustering problems. In this paper, a comprehensive review is presented to show the most related nature-inspired algorithms that have been used in solving the text clustering problem. Moreover, comprehensive experiments are conducted and analyzed to show the performance of the common well-know nature-inspired optimization algorithms in solving the text document clustering problems including Harmony Search (HS) Algorithm, Genetic Algorithm (GA), Particle Swarm Optimization (PSO) Algorithm, Ant Colony Optimization (ACO), Krill Herd Algorithm (KHA), Cuckoo Search (CS) Algorithm, Gray Wolf Optimizer (GWO), and Bat-inspired Algorithm (BA). Seven text benchmark datasets are used to validate the performance of the tested algorithms. The results showed that the performance of the well-known nurture-inspired optimization algorithms almost the same with slight differences. For improvement purposes, new modified versions of the tested algorithms can be proposed and tested to tackle the text clustering problems.

Download Full-text

Using Sequences of Words for Non-Disjoint Grouping of Documents

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415500135 ◽

2015 ◽

Vol 29 (03) ◽

pp. 1550013 ◽

Cited By ~ 6

Author(s):

Chiheb-Eddine Ben N'Cir ◽

Nadia Essoussi

Keyword(s):

Learning Algorithm ◽

Text Clustering ◽

Unstructured Data ◽

Text Documents ◽

Space Model ◽

Word Sequence ◽

Text Document ◽

Text Collections ◽

Textual Data ◽

Textual Content

Grouping documents based on their textual content is an important application of clustering referred to as text clustering. This paper deals with two issues in text clustering which are the detection of non-disjoint groups and the representation of textual data. In fact, a text document can discuss several topics and then, it must belong to several groups. The learning algorithm must be able to produce non-disjoint clusters and assigns documents to several clusters. Given that text documents are considered as unstructured data, the application of a learning algorithm requires to prepare a set of documents for numerical analysis by using the vector space model (VSM). This representation of text avoids correlation between terms and does not give importance to the order of words in the text. Therefore, we present in this paper an unsupervised learning method, based on the word sequence kernel, where the correlation between adjacent words in text and the possibility of document to belong to more than one cluster are not ignored. In addition, to facilitate the use of this method in text-analytic practice, we present the "DocCO" software which is publicly available. Experiments performed on several text collections show that the proposed method outperforms existing overlapping methods using VSM representation in terms of clustering accuracy.

Download Full-text

Particle Grey Wolf Optimizer (PGWO) Algorithm and Semantic Word Processing for Automatic Text Clustering

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488519500090 ◽

2019 ◽

Vol 27 (02) ◽

pp. 201-223 ◽

Cited By ~ 2

Author(s):

Ch. Vidyadhari ◽

N. Sandhya ◽

P. Premchand

Keyword(s):

Word Processing ◽

Text Categorization ◽

Text Clustering ◽

Significant Feature ◽

Grey Wolf Optimizer ◽

Grey Wolf ◽

Text Documents ◽

Text Document ◽

Text Feature ◽

Automatic Text

Text mining refers to the process of extracting the high-quality information from the text. It is broadly used in applications, like text clustering, text categorization, text classification, etc. Recently, the text clustering becomes the facilitating and challenging task used to group the text document. Due to some irrelevant terms and large dimension, the accuracy of text clustering is reduced. In this paper, the semantic word processing and novel Particle Grey Wolf Optimizer (PGWO) is proposed for automatic text clustering. Initially, the text documents are given as input to the pre-processing step which caters the useful keyword for feature extraction and clustering. Then, the resultant keyword is applied to wordnet ontology to find out the synonyms and hyponyms of every keyword. Subsequently, the frequency is determined for every keyword which is used to build the text feature library. Since the text feature library contains the larger dimension, the entropy is utilized to select the most significant feature. Finally, the new algorithm Particle Grey Wolf Optimizer (PGWO) is developed by integrating the particle swarm optimization (PSO) into the grey wolf optimizer (GWO). Thus, the proposed algorithm is used to assign the class labels to generate the different clusters of text documents. The simulation is performed to analyze the performance of the proposed algorithm, and the proposed algorithm is compared with existing algorithms. The proposed method attains the clustering accuracy of 80.36% for 20 Newsgroup dataset and the clustering accuracy of 79.63% for Reuter which ensures the better automatic text clustering.

Download Full-text

Music Emotion Classification based on Lyrics-Audio using Corpus based Emotion

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i3.pp1720-1730 ◽

2018 ◽

Vol 8 (3) ◽

pp. 1720 ◽

Cited By ~ 6

Author(s):

Fika Hastarita Rachman ◽

Riyanarto Sarno ◽

Chastine Fatichah

Keyword(s):

Audio Signal ◽

Extraction Process ◽

Emotion Classification ◽

Text Documents ◽

Text Data ◽

Text Document ◽

Audio Features ◽

Test Result ◽

F Measure

Music has lyrics and audio. That’s components can be a feature for music emotion classification. Lyric features were extracted from text data and audio features were extracted from audio signal data.In the classification of emotions, emotion corpus is required for lyrical feature extraction. Corpus Based Emotion (CBE) succeed to increase the value of F-Measure for emotion classification on text documents. The music document has an unstructured format compared with the article text document. So it requires good preprocessing and conversion process before classification process. We used MIREX Dataset for this research. Psycholinguistic and stylistic features were used as lyrics features. Psycholinguistic feature was a feature that related to the category of emotion. In this research, CBE used to support the extraction process of psycholinguistic feature. Stylistic features related with usage of unique words in the lyrics, e.g. ‘ooh’, ‘ah’, ‘yeah’, etc. Energy, temporal and spectrum features were extracted for audio features.The best test result for music emotion classification was the application of Random Forest methods for lyrics and audio features. The value of F-measure was 56.8%.

Download Full-text

Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering

Mathematics ◽

10.3390/math9161929 ◽

2021 ◽

Vol 9 (16) ◽

pp. 1929

Author(s):

Timea Bezdan ◽

Catalin Stoean ◽

Ahmed Al Naamany ◽

Nebojsa Bacanin ◽

Tarik A. Rashid ◽

...

Keyword(s):

Optimization Algorithm ◽

Document Clustering ◽

Fruit Fly ◽

Text Clustering ◽

Relevant Information ◽

Fruit Fly Optimization Algorithm ◽

Hybrid Swarm ◽

Text Data ◽

Fruit Fly Optimization ◽

Text Document

The fast-growing Internet results in massive amounts of text data. Due to the large volume of the unstructured format of text data, extracting relevant information and its analysis becomes very challenging. Text document clustering is a text-mining process that partitions the set of text-based documents into mutually exclusive clusters in such a way that documents within the same group are similar to each other, while documents from different clusters differ based on the content. One of the biggest challenges in text clustering is partitioning the collection of text data by measuring the relevance of the content in the documents. Addressing this issue, in this work a hybrid swarm intelligence algorithm with a K-means algorithm is proposed for text clustering. First, the hybrid fruit-fly optimization algorithm is tested on ten unconstrained CEC2019 benchmark functions. Next, the proposed method is evaluated on six standard benchmark text datasets. The experimental evaluation on the unconstrained functions, as well as on text-based documents, indicated that the proposed approach is robust and superior to other state-of-the-art methods.

Download Full-text

LSA & LDA topic modeling classification: comparison study on e-books

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v19.i1.pp353-362 ◽

2020 ◽

Vol 19 (1) ◽

pp. 353

Author(s):

Shaymaa H. Mohammed ◽

Salam Al-augby

Keyword(s):

Digital Libraries ◽

Full Text ◽

Topic Modeling ◽

Comparison Study ◽

Text Documents ◽

Text Data ◽

Text Document ◽

Unstructured Text ◽

The One ◽

Text Document Classification

With the rapid growth of information technology, the amount of unstructured text data in digital libraries is rapidly increased and has become a big challenge in analyzing, organizing and how to classify text automatically in E-research repository to get the benefit from them is the cornerstone. The manual categorization of text documents requires a lot of financial, human resources for management. In order to get so, topic modeling are used to classify documents. This paper addresses a comparison study on scientific unstructured text document classification (e-books) based on the full text where applying the most popular topic modeling approach (LDA, LSA) to cluster the words into a set of topics as important keywords for classification. Our dataset consists of (300) books contain about 23 million words based on full text. In the used topic models (LSA, LDA) each word in the corpus of vocabulary is connected with one or more topics with a probability, as estimated by the model. Many (LDA, LSA) models were built with different values of coherence and pick the one that produces the highest coherence value. The result of this paper showed that LDA has better results than LSA and the best results obtained from the LDA method was (0.592179) of coherence value when the number of topics was 20 while the LSA coherence value was (0.5773026) when the number of topics was 10.

Download Full-text

Text Document Summarization Using POS tagging for Kannada Text Documents

2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence) ◽

10.1109/confluence51648.2021.9377106 ◽

2021 ◽

Author(s):

Jayashree R ◽

Basavaraj S Anami ◽

Poornima B K

Keyword(s):

Text Documents ◽

Document Summarization ◽

Pos Tagging ◽

Text Document

Download Full-text