An alternative clustering approach for reconstructing cross cut shredded text documents

2011 ◽  
Vol 52 (3) ◽  
pp. 1491-1501 ◽  
Author(s):  
Azzam Sleit ◽  
Yacoub Massad ◽  
Mohammed Musaddaq
Optimization ◽  
2013 ◽  
Vol 62 (2) ◽  
pp. 227-240 ◽  
Author(s):  
Alireza Amirteimoori ◽  
Sohrab Kordrostami

2018 ◽  
Vol 7 (3) ◽  
pp. 213-224
Author(s):  
Rafał Woźniak ◽  
Piotr Ożdżyński ◽  
Danuta Zakrzewska

The development of Internet resulted in an increasing number of online text re-positories. In many cases, documents are assigned to more than one class and automatic multi-label classification needs to be used. When the number of labels exceeds the number of the documents, effective label space dimension reduction may signifi-cantly improve classification accuracy, what is a major priority in the medical field. In the paper, we propose document clustering for label selection. We use semi-clustering method, by considering graph representation, where documents are represented by vertices and edge weights are calculated according to their mutual similarity. Assigning documents to semi-clusters helps in reducing number of labels, further used in multilabel classification process. The performance of the method is examined by experiments conducted on real medical datasets.


2017 ◽  
Vol 2017 ◽  
pp. 1-13 ◽  
Author(s):  
Junkai Yi ◽  
Yacong Zhang ◽  
Xianghui Zhao ◽  
Jing Wan

Text clustering is an effective approach to collect and organize text documents into meaningful groups for mining valuable information on the Internet. However, there exist some issues to tackle such as feature extraction and data dimension reduction. To overcome these problems, we present a novel approach named deep-learning vocabulary network. The vocabulary network is constructed based on related-word set, which contains the “cooccurrence” relations of words or terms. We replace term frequency in feature vectors with the “importance” of words in terms of vocabulary network and PageRank, which can generate more precise feature vectors to represent the meaning of text clustering. Furthermore, sparse-group deep belief network is proposed to reduce the dimensionality of feature vectors, and we introduce coverage rate for similarity measure in Single-Pass clustering. To verify the effectiveness of our work, we compare the approach to the representative algorithms, and experimental results show that feature vectors in terms of deep-learning vocabulary network have better clustering performance.


2017 ◽  
Vol 79 (5) ◽  
Author(s):  
Athraa Jasim Mohammed ◽  
Yuhanis Yusof ◽  
Husniza Husni

Text clustering is one of the text mining tasks that is employed in search engines. Discovering the optimal number of clusters for a dataset or repository is a challenging problem. Various clustering algorithms have been reported in the literature but most of them rely on a pre-defined value of the k clusters. In this study, a variant of Firefly algorithm, termed as FireflyClust, is proposed to automatically cluster text documents in a hierarchical manner. The proposed clustering method operates based on five phases: data pre-processing, clustering, item re-location, cluster selection and cluster refinement. Experiments are undertaken based on different selections of threshold value. Results on the TREC collection named TR11, TR12, TR23 and TR45, showed that the FireflyClust is a better approach than the Bisect K-means, hybrid Bisect K-means and Practical General Stochastic Clustering Method. Such a result would enlighten the directions in developing a better information retrieval engine for this dynamic and fast growing big data era.


2019 ◽  
Vol 8 (3) ◽  
pp. 6634-6643 ◽  

Opinion mining and sentiment analysis are valuable to extract the useful subjective information out of text documents. Predicting the customer’s opinion on amazon products has several benefits like reducing customer churn, agent monitoring, handling multiple customers, tracking overall customer satisfaction, quick escalations, and upselling opportunities. However, performing sentiment analysis is a challenging task for the researchers in order to find the users sentiments from the large datasets, because of its unstructured nature, slangs, misspells and abbreviations. To address this problem, a new proposed system is developed in this research study. Here, the proposed system comprises of four major phases; data collection, pre-processing, key word extraction, and classification. Initially, the input data were collected from the dataset: amazon customer review. After collecting the data, preprocessing was carried-out for enhancing the quality of collected data. The pre-processing phase comprises of three systems; lemmatization, review spam detection, and removal of stop-words and URLs. Then, an effective topic modelling approach Latent Dirichlet Allocation (LDA) along with modified Possibilistic Fuzzy C-Means (PFCM) was applied to extract the keywords and also helps in identifying the concerned topics. The extracted keywords were classified into three forms (positive, negative and neutral) by applying an effective machine learning classifier: Convolutional Neural Network (CNN). The experimental outcome showed that the proposed system enhanced the accuracy in sentiment analysis up to 6-20% related to the existing systems.


Author(s):  
Hussain A. Jaber ◽  
Ilyas Çankaya ◽  
Hadeel K. Aljobouri ◽  
Orhan M. Koçak ◽  
Oktay Algin

Background: Cluster analysis is a robust tool for exploring the underlining structures in data and grouping them with similar objects. In the researches of Functional Magnetic Resonance Imaging (fMRI), clustering approaches attempt to classify voxels depending on their time-course signals into a similar hemodynamic response over time. Objective: In this work, a novel unsupervised learning approach is proposed that relies on using Enhanced Neural Gas (ENG) algorithm in fMRI data for comparison with Neural Gas (NG) method, which has yet to be utilized for that aim. The ENG algorithm depends on the network structure of the NG and concentrates on an efficacious prototype-based clustering approach. Methods: The comparison outcomes on real auditory fMRI data show that ENG outperforms the NG and statistical parametric mapping (SPM) methods due to its insensitivity to the ordering of input data sequence, various initializations for selecting a set of neurons, and the existence of extreme values (outliers). The findings also prove its capability to discover the exact and real values of a cluster number effectively. Results: Four validation indices are applied to evaluate the performance of the proposed ENG method with fMRI and compare it with a clustering approach (NG algorithm) and model-based data analysis (SPM). These validation indices include the Jaccard Coefficient (JC), Receiver Operating Characteristic (ROC), Minimum Description Length (MDL) value, and Minimum Square Error (MSE). Conclusion: The ENG technique can tackle all shortcomings of NG application with fMRI data, identify the active area of the human brain effectively, and determine the locations of the cluster center based on the MDL value during the process of network learning.


Sign in / Sign up

Export Citation Format

Share Document