clustering and classification
Recently Published Documents


TOTAL DOCUMENTS

547
(FIVE YEARS 198)

H-INDEX

29
(FIVE YEARS 7)

Mathematics ◽  
2022 ◽  
Vol 10 (1) ◽  
pp. 128
Author(s):  
Güvenç Arslan ◽  
Uğur Madran ◽  
Duygu Soyoğlu

In this note, we propose a novel classification approach by introducing a new clustering method, which is used as an intermediate step to discover the structure of a data set. The proposed clustering algorithm uses similarities and the concept of a clique to obtain clusters, which can be used with different strategies for classification. This approach also reduces the size of the training data set. In this study, we apply support vector machines (SVMs) after obtaining clusters with the proposed clustering algorithm. The proposed clustering algorithm is applied with different strategies for applying SVMs. The results for several real data sets show that the performance is comparable with the standard SVM while reducing the size of the training data set and also the number of support vectors.


2022 ◽  
pp. 1843-1863
Author(s):  
Viju Raghupathi ◽  
Yilu Zhou ◽  
Wullianallur Raghupathi

In this article, the authors explore the potential of a big data analytics approach to unstructured text analytics of cancer blogs. The application is developed using Cloudera platform's Hadoop MapReduce framework. It uses several text analytics algorithms, including word count, word association, clustering, and classification, to identify and analyze the patterns and keywords in cancer blog postings. This article establishes an exploratory approach to involving big data analytics methods in developing text analytics applications for the analysis of cancer blogs. Additional insights are extracted through various means, including the development of categories or keywords contained in the blogs, the development of a taxonomy, and the examination of relationships among the categories. The application has the potential for generalizability and implementation with health content in other blogs and social media. It can provide insight and decision support for cancer management and facilitate efficient and relevant searches for information related to cancer.


2021 ◽  
Vol 2021 ◽  
pp. 1-8
Author(s):  
Qiujuan Yang

As the most basic element in English learning, vocabulary has always been the focus of teaching in college English classes, but the teaching effect is often unsatisfactory. In this paper, the genetic algorithm fitness function design part is integrated with the K-medoids algorithm to form K-GA-medoids, and secondly, it is combined with KNN to form an algorithmic framework for English vocabulary classification. In the classification process, clustering and classification steps are taken to realize the reduction of the training set samples and thus reduce the computational overhead. The experiments show that K-GA-medoids have significantly improved the clustering effect compared with traditional K-medoids, and the combination of K-GA-medoids and KNNs has effectively improved the efficiency of English vocabulary classification compared with the traditional KNN algorithm, while ensuring the classification accuracy. We found that students in college English course consider word memorization as a difficult learning task, and the traditional vocabulary teaching methods are not very effective, and the knowledge of etymology is often little known and rarely covered in classroom lectures. Therefore, the article explores new ideas and strategies for teaching vocabulary in college English from the perspective of etymology.


2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Wenyun Gao ◽  
Xiaoyun Li ◽  
Sheng Dai ◽  
Xinghui Yin ◽  
Stanley Ebhohimhen Abhadiomhen

The low-rank representation (LRR) method has recently gained enormous popularity due to its robust approach in solving the subspace segmentation problem, particularly those concerning corrupted data. In this paper, the recursive sample scaling low-rank representation (RSS-LRR) method is proposed. The advantage of RSS-LRR over traditional LRR is that a cosine scaling factor is further introduced, which imposes a penalty on each sample to minimize noise and outlier influence better. Specifically, the cosine scaling factor is a similarity measure learned to extract each sample’s relationship with the low-rank representation’s principal components in the feature space. In order words, the smaller the angle between an individual data sample and the low-rank representation’s principal components, the more likely it is that the data sample is clean. Thus, the proposed method can then effectively obtain a good low-rank representation influenced mainly by clean data. Several experiments are performed with varying levels of corruption on ORL, CMU PIE, COIL20, COIL100, and LFW in order to evaluate RSS-LRR’s effectiveness over state-of-the-art low-rank methods. The experimental results show that RSS-LRR consistently performs better than the compared methods in image clustering and classification tasks.


Author(s):  
Srinjoy Das ◽  
Hrushikesh N. Mhaskar ◽  
Alexander Cloninger

This paper introduces kdiff, a novel kernel-based measure for estimating distances between instances of time series, random fields and other forms of structured data. This measure is based on the idea of matching distributions that only overlap over a portion of their region of support. Our proposed measure is inspired by MPdist which has been previously proposed for such datasets and is constructed using Euclidean metrics, whereas kdiff is constructed using non-linear kernel distances. Also, kdiff accounts for both self and cross similarities across the instances and is defined using a lower quantile of the distance distribution. Comparing the cross similarity to self similarity allows for measures of similarity that are more robust to noise and partial occlusions of the relevant signals. Our proposed measure kdiff is a more general form of the well known kernel-based Maximum Mean Discrepancy distance estimated over the embeddings. Some theoretical results are provided for separability conditions using kdiff as a distance measure for clustering and classification problems where the embedding distributions can be modeled as two component mixtures. Applications are demonstrated for clustering of synthetic and real-life time series and image data, and the performance of kdiff is compared to competing distance measures for clustering.


Author(s):  
Souad Azzouzi ◽  
Amal Hjouji ◽  
Jaouad EL- Mekkaoui ◽  
Ahmed EL Khalfi

The Fuzzy C-means (FCM) algorithm has been widely used in the field of clustering and classification but has encountered difficulties with noisy data and outliers. Other versions of algorithms related to possibilistic theory have given good results, such as Fuzzy C- Means(FCM), possibilistic C-means (PCM), Fuzzy possibilistic C-means (FPCM) and possibilistic fuzzy C- Means algorithm (PFCM).This last algorithm works effectively in some environments but encountered more shortcomings with noisy databases. To solve this problem, we propose in this manuscript, a new algorithm named Improved Possibilistic Fuzzy C-Means (ImPFCM) by combining the PFCM algorithm with a very powerful statistical method. The properties of this new ImPFCM algorithm show that it is not only applicable on clusters of spherical shapes, but also on clusters of different sizes and densities. The results of the comparative study with very recent algorithms indicate the performance and the superiority of the proposed approach to easily group the datasets in a large-dimensional space and to use not only the Euclidean distance but more sophisticated standards norms, capable to deal with much more complicated problems. On the other hand, we have demonstrated that the ImPFCM algorithm is also capable of detecting the cluster center with high accuracy and performing satisfactorily in multiple environments with noisy data and outliers.


Webology ◽  
2021 ◽  
Vol 18 (Special Issue 05) ◽  
pp. 1137-1157
Author(s):  
V. Vamsi Krishna ◽  
G. Gopinath

Automatic functional tests are a long-standing issue in software development projects, and they are still carried out manually. The Selenium testing framework has gained popularity as an active community and standard environment for automated assessment of web applications. As a result, the trend setting of web services is evolving on a daily basis, and there is a need to improve automatic testing. The study involves to make the system to understand the experiences of previous test cases and apply new cases to predict the status of test case using Tanh activated Clustering and Classification model (TACC). The primary goal is to improve the model's clustering and classification output. The outcomes show that the TACC model has increased performance and demonstrated that automated testing results can be predicted, which is cost effective and reduces manual effort to a greater extent.


2021 ◽  
Author(s):  
Masahiro Kuroda

Mixture models become increasingly popular due to their modeling flexibility and are applied to the clustering and classification of heterogeneous data. The EM algorithm is largely used for the maximum likelihood estimation of mixture models because the algorithm is stable in convergence and simple in implementation. Despite such advantages, it is pointed out that the EM algorithm is local and has slow convergence as the main drawback. To avoid the local convergence of the EM algorithm, multiple runs from several different initial values are usually used. Then the algorithm may take a large number of iterations and long computation time to find the maximum likelihood estimates. The speedup of computation of the EM algorithm is available for these problems. We give the algorithms to accelerate the convergence of the EM algorithm and apply them to mixture model estimation. Numerical experiments examine the performance of the acceleration algorithms in terms of the number of iterations and computation time.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Rongbo Chen ◽  
Haojun Sun ◽  
Lifei Chen ◽  
Jianfei Zhang ◽  
Shengrui Wang

AbstractMarkov models are extensively used for categorical sequence clustering and classification due to their inherent ability to capture complex chronological dependencies hidden in sequential data. Existing Markov models are based on an implicit assumption that the probability of the next state depends on the preceding context/pattern which is consist of consecutive states. This restriction hampers the models since some patterns, disrupted by noise, may be not frequent enough in a consecutive form, but frequent in a sparse form, which can not make use of the information hidden in the sequential data. A sparse pattern corresponds to a pattern in which one or some of the state(s) between the first and last one in the pattern is/are replaced by wildcard(s) that can be matched by a subset of values in the state set. In this paper, we propose a new model that generalizes the conventional Markov approach making it capable of dealing with the sparse pattern and handling the length of the sparse patterns adaptively, i.e. allowing variable length pattern with variable wildcards. The model, named Dynamic order Markov model (DOMM), allows deriving a new similarity measure between a sequence and a set of sequences/cluster. DOMM builds a sparse pattern from sub-frequent patterns that contain significant statistical information veiled by the noise. To implement DOMM, we propose a sparse pattern detector (SPD) based on the probability suffix tree (PST) capable of discovering both sparse and consecutive patterns, and then we develop a divisive clustering algorithm, named DMSC, for Dynamic order Markov model for categorical sequence clustering. Experimental results on real-world datasets demonstrate the promising performance of the proposed model.


Sign in / Sign up

Export Citation Format

Share Document