Distance Measures and Stemming Impact on ‎Arabic Document Clustering

Author(s):  
Qusay Bsoul ◽  
Eiman Al-Shamari ◽  
Masnizah Mohd ◽  
Jaffar Atwan
Author(s):  
Mari-Sanna Paukkeri ◽  
Ilkka Kivimäki ◽  
Santosh Tirunagari ◽  
Erkki Oja ◽  
Timo Honkela

2019 ◽  
Vol 8 (2) ◽  
pp. 2938-2942

Due to the huge growth of internet usage, large volume of information flow has also been increased, which leads to the problem of information congestion. In unsupervised learning, clustering is consider as most important problem. Big quality, high dimensionality and complicated semantics are the difficult issue of document clustering.it focus on the way of identifying a structure from an unlabeled data collection. A cluster is a method in which the data items are identified and grouped based on the resemblance between the objects from a dissimilar object set. Decision of a good cluster, can be demonstrated that there is no absolute “best” criterion independent of the final objective of the clustering. A good document clustering scheme’s primary objective is to minimize intra-cluster distance between papers while maximizing inter-cluster distance(using a suitable document distance measure).A distance measure(or, dually, measure of resemblance)is therefore at the core of document clustering. This assessment gives an implication about the different methods(Vector Space Model, Latent Sematic Indexing, Latent Dirichlet Allocation, Singular Value Decomposition, Doc2Vec Model, Graph model), distance measures(Euclidean Distance, Cosine Similarity, Jaccard Coefficient, Pearson Correlation Coefficient)and evaluation parameters of document clustering. This work is theoretical in nature and aims to corner the overall procedure of document clustering.


2016 ◽  
Vol 5 (1) ◽  
pp. 61-80
Author(s):  
SAMEERA G ◽  
◽  
VISHNU VARDHAN R
Keyword(s):  

Author(s):  
Laith Mohammad Abualigah ◽  
Essam Said Hanandeh ◽  
Ahamad Tajudin Khader ◽  
Mohammed Abdallh Otair ◽  
Shishir Kumar Shandilya

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.


2017 ◽  
Vol 2017 ◽  
pp. 1-13 ◽  
Author(s):  
Jimena Olveres ◽  
Erik Carbajal-Degante ◽  
Boris Escalante-Ramírez ◽  
Enrique Vallejo ◽  
Carla María García-Moreno

Segmentation tasks in medical imaging represent an exhaustive challenge for scientists since the image acquisition nature yields issues that hamper the correct reconstruction and visualization processes. Depending on the specific image modality, we have to consider limitations such as the presence of noise, vanished edges, or high intensity differences, known, in most cases, as inhomogeneities. New algorithms in segmentation are required to provide a better performance. This paper presents a new unified approach to improve traditional segmentation methods as Active Shape Models and Chan-Vese model based on level set. The approach introduces a combination of local analysis implementations with classic segmentation algorithms that incorporates local texture information given by the Hermite transform and Local Binary Patterns. The mixture of both region-based methods and local descriptors highlights relevant regions by considering extra information which is helpful to delimit structures. We performed segmentation experiments on 2D images including midbrain in Magnetic Resonance Imaging and heart’s left ventricle endocardium in Computed Tomography. Quantitative evaluation was obtained with Dice coefficient and Hausdorff distance measures. Results display a substantial advantage over the original methods when we include our characterization schemes. We propose further research validation on different organ structures with promising results.


Symmetry ◽  
2021 ◽  
Vol 13 (3) ◽  
pp. 436
Author(s):  
Ruirui Zhao ◽  
Minxia Luo ◽  
Shenggang Li

Picture fuzzy sets, which are the extension of intuitionistic fuzzy sets, can deal with inconsistent information better in practical applications. A distance measure is an important mathematical tool to calculate the difference degree between picture fuzzy sets. Although some distance measures of picture fuzzy sets have been constructed, there are some unreasonable and counterintuitive cases. The main reason is that the existing distance measures do not or seldom consider the refusal degree of picture fuzzy sets. In order to solve these unreasonable and counterintuitive cases, in this paper, we propose a dynamic distance measure of picture fuzzy sets based on a picture fuzzy point operator. Through a numerical comparison and multi-criteria decision-making problems, we show that the proposed distance measure is reasonable and effective.


Author(s):  
Paraskevi Massara ◽  
Charles D G Keown-Stoneman ◽  
Lauren Erdman ◽  
Eric O Ohuma ◽  
Celine Bourdon ◽  
...  

Abstract Background Most studies on children evaluate longitudinal growth as an important health indicator. Different methods have been used to detect growth patterns across childhood, but with no comparison between them to evaluate result consistency. We explored the variation in growth patterns as detected by different clustering and latent class modelling techniques. Moreover, we investigated how the characteristics/features (e.g. slope, tempo, velocity) of longitudinal growth influence pattern detection. Methods We studied 1134 children from The Applied Research Group for Kids cohort with longitudinal-growth measurements [height, weight, body mass index (BMI)] available from birth until 12 years of age. Growth patterns were identified by latent class mixed models (LCMM) and time-series clustering (TSC) using various algorithms and distance measures. Time-invariant features were extracted from all growth measures. A random forest classifier was used to predict the identified growth patterns for each growth measure using the extracted features. Results Overall, 72 TSC configurations were tested. For BMI, we identified three growth patterns by both TSC and LCMM. The clustering agreement was 58% between LCMM and TS clusters, whereas it varied between 30.8% and 93.3% within the TSC configurations. The extracted features (n = 67) predicted the identified patterns for each growth measure with accuracy of 82%–89%. Specific feature categories were identified as the most important predictors for patterns of all tested growth measures. Conclusion Growth-pattern detection is affected by the method employed. This can impact on comparisons across different populations or associations between growth patterns and health outcomes. Growth features can be reliably used as predictors of growth patterns.


Molecules ◽  
2021 ◽  
Vol 26 (6) ◽  
pp. 1546
Author(s):  
Ioanna Dagla ◽  
Anthony Tsarbopoulos ◽  
Evagelos Gikas

Colistimethate sodium (CMS) is widely administrated for the treatment of life-threatening infections caused by multidrug-resistant Gram-negative bacteria. Until now, the quality control of CMS formulations has been based on microbiological assays. Herein, an ultra-high-performance liquid chromatography coupled to ultraviolet detector methodology was developed for the quantitation of CMS in injectable formulations. The design of experiments was performed for the optimization of the chromatographic parameters. The chromatographic separation was achieved using a Waters Acquity BEH C8 column employing gradient elution with a mobile phase consisting of (A) 0.001 M aq. ammonium formate and (B) methanol/acetonitrile 79/21 (v/v). CMS compounds were detected at 214 nm. In all, 23 univariate linear-regression models were constructed to measure CMS compounds separately, and one partial least-square regression (PLSr) model constructed to assess the total CMS amount in formulations. The method was validated over the range 100–220 μg mL−1. The developed methodology was employed to analyze several batches of CMS injectable formulations that were also compared against a reference batch employing a Principal Component Analysis, similarity and distance measures, heatmaps and the structural similarity index. The methodology was based on freely available software in order to be readily available for the pharmaceutical industry.


Author(s):  
Ruina Bai ◽  
Ruizhang Huang ◽  
Yanping Chen ◽  
Yongbin Qin

Sign in / Sign up

Export Citation Format

Share Document