An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop

Author(s):  
Lokesh Sahu ◽  
Biju R. Mohan
Author(s):  
Youssef Elfakir ◽  
Ghizlane Khaissidi ◽  
Mostafa Mrabti ◽  
Driss Chenouni ◽  
Manal Boualam

The similarity or the distance measure have been used widely to calculate the similarity or dissimilarity between vector sequences, where the document images similarity is known as the domain that dealing with image information and both similarity/distance has been an important role for matching and pattern recognition. There are several types of similarity measure, we cover in this paper the survey of various distance measures used in the images matching and we explain the limitations associated with the existing distances. Then, we introduce the concept of the floating distance which describes the variation of the threshold’s selection for each word in decision making process, based on a combination of Linear Regression and cosine distance. Experiments are carried out on a handwritten Arabic image documents of Gallica library. These experiments show that the proposed floating distance outperforms the traditional distance in word spotting system.


Symmetry ◽  
2018 ◽  
Vol 10 (11) ◽  
pp. 602 ◽  
Author(s):  
Donghai Liu ◽  
Xiaohong Chen ◽  
Dan Peng

This paper proposes a neutrosophic hesitant fuzzy linguistic term set (NHFLTS) based on hesitant fuzzy linguistic term set (HFLTS) and neutrosophic set (NS), which can express the inconsistent and uncertainty information flexibly in multiple criteria decision making problems. The basic operational laws of NHFLTS based on linguistic scale function are also discussed. Then we propose the generalized neutrosophic hesitant fuzzy linguistic distance measure and discuss its properties. Furthermore, a new similarity measure of NHFLTS combines the generalized neutrosophic hesitant fuzzy linguistic distance measure and the cosine function is given. A corresponding cosine distance measure between NHFLTSs is proposed according to the relationship between the similarity measure and the distance measure, and we develop the technique for order preference by similarity to an ideal solution (TOPSIS) method to the obtained cosine distance measure. The main advantages of the proposed NHFLTS is defined on linguistic scale function, the decision makers can flexibly convert the linguistic information to semantic values, and the proposed cosine distance measure between NHFLTSs with TOPSIS method can deal with the related decision information not only from the point of view of algebra, but also from the point of view of geometry. Finally, the reasonableness and effectiveness of the proposed method is demonstrated by the illustrative example, which is also compared to the other existing methods.


Author(s):  
U. K. Sridevi ◽  
P. Shanthi ◽  
N. Nagaveni

Searching of relevant documents from the web has become more challenging due to the rapid growth in information. Although there is enormous amount of information available online, most of the documents are uncategorized. It is a time-consuming task for the users to browse through a large number of documents and search for information about the specific topics. The automatic clustering from these documents could be important and has great potential to improve the efficiency of information seeking behaviors. To address this issue, the authors propose a deep ontology-based approach to document clustering. The obtained results are encouraging and in implementation annotation rules are used. The work compared the information extraction capabilities of annotated framework of using ontology and without using ontology. The increase in F-measure is achieved when ontology as the distance measure. The improvement of 11% is achieved by ontology in comparison with keyword search.


2019 ◽  
Vol 8 (2) ◽  
pp. 2938-2942

Due to the huge growth of internet usage, large volume of information flow has also been increased, which leads to the problem of information congestion. In unsupervised learning, clustering is consider as most important problem. Big quality, high dimensionality and complicated semantics are the difficult issue of document clustering.it focus on the way of identifying a structure from an unlabeled data collection. A cluster is a method in which the data items are identified and grouped based on the resemblance between the objects from a dissimilar object set. Decision of a good cluster, can be demonstrated that there is no absolute “best” criterion independent of the final objective of the clustering. A good document clustering scheme’s primary objective is to minimize intra-cluster distance between papers while maximizing inter-cluster distance(using a suitable document distance measure).A distance measure(or, dually, measure of resemblance)is therefore at the core of document clustering. This assessment gives an implication about the different methods(Vector Space Model, Latent Sematic Indexing, Latent Dirichlet Allocation, Singular Value Decomposition, Doc2Vec Model, Graph model), distance measures(Euclidean Distance, Cosine Similarity, Jaccard Coefficient, Pearson Correlation Coefficient)and evaluation parameters of document clustering. This work is theoretical in nature and aims to corner the overall procedure of document clustering.


2012 ◽  
Vol 57 (3) ◽  
pp. 829-835 ◽  
Author(s):  
Z. Głowacz ◽  
J. Kozik

The paper describes a procedure for automatic selection of symptoms accompanying the break in the synchronous motor armature winding coils. This procedure, called the feature selection, leads to choosing from a full set of features describing the problem, such a subset that would allow the best distinguishing between healthy and damaged states. As the features the spectra components amplitudes of the motor current signals were used. The full spectra of current signals are considered as the multidimensional feature spaces and their subspaces are tested. Particular subspaces are chosen with the aid of genetic algorithm and their goodness is tested using Mahalanobis distance measure. The algorithm searches for such a subspaces for which this distance is the greatest. The algorithm is very efficient and, as it was confirmed by research, leads to good results. The proposed technique is successfully applied in many other fields of science and technology, including medical diagnostics.


Author(s):  
Laith Mohammad Abualigah ◽  
Essam Said Hanandeh ◽  
Ahamad Tajudin Khader ◽  
Mohammed Abdallh Otair ◽  
Shishir Kumar Shandilya

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.


2019 ◽  
Vol 5 (6) ◽  
pp. 57 ◽  
Author(s):  
Gang Wang ◽  
Bernard De Baets

Superpixel segmentation can benefit from the use of an appropriate method to measure edge strength. In this paper, we present such a method based on the first derivative of anisotropic Gaussian kernels. The kernels can capture the position, direction, prominence, and scale of the edge to be detected. We incorporate the anisotropic edge strength into the distance measure between neighboring superpixels, thereby improving the performance of an existing graph-based superpixel segmentation method. Experimental results validate the superiority of our method in generating superpixels over the competing methods. It is also illustrated that the proposed superpixel segmentation method can facilitate subsequent saliency detection.


Sign in / Sign up

Export Citation Format

Share Document