A spectral-based clustering algorithm for categorical data using data summaries

Author(s):  
Eman Abdu ◽  
Douglas Salane
2018 ◽  
Vol 3 (1) ◽  
pp. 001
Author(s):  
Zulhendra Zulhendra ◽  
Gunadi Widi Nurcahyo ◽  
Julius Santony

In this study using Data Mining, namely K-Means Clustering. Data Mining can be used in searching for a large enough data analysis that aims to enable Indocomputer to know and classify service data based on customer complaints using Weka Software. In this study using the algorithm K-Means Clustering to predict or classify complaints about hardware damage on Payakumbuh Indocomputer. And can find out the data of Laptop brands most do service on Indocomputer Payakumbuh as one of the recommendations to consumers for the selection of Laptops.


Algorithms ◽  
2021 ◽  
Vol 14 (6) ◽  
pp. 184
Author(s):  
Xia Que ◽  
Siyuan Jiang ◽  
Jiaoyun Yang ◽  
Ning An

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.


2020 ◽  
Vol 500 (1) ◽  
pp. 1323-1339
Author(s):  
Ciria Lima-Dias ◽  
Antonela Monachesi ◽  
Sergio Torres-Flores ◽  
Arianna Cortesi ◽  
Daniel Hernández-Lang ◽  
...  

ABSTRACT The nearby Hydra cluster (∼50 Mpc) is an ideal laboratory to understand, in detail, the influence of the environment on the morphology and quenching of galaxies in dense environments. We study the Hydra cluster galaxies in the inner regions (1R200) of the cluster using data from the Southern Photometric Local Universe Survey, which uses 12 narrow and broad-band filters in the visible region of the spectrum. We analyse structural (Sérsic index, effective radius) and physical (colours, stellar masses, and star formation rates) properties. Based on this analysis, we find that ∼88 per cent of the Hydra cluster galaxies are quenched. Using the Dressler–Schectman test approach, we also find that the cluster shows possible substructures. Our analysis of the phase-space diagram together with density-based spatial clustering algorithm indicates that Hydra shows an additional substructure that appears to be in front of the cluster centre, which is still falling into it. Our results, thus, suggest that the Hydra cluster might not be relaxed. We analyse the median Sérsic index as a function of wavelength and find that for red [(u − r) ≥2.3] and early-type galaxies it displays a slight increase towards redder filters (13 and 18 per cent, for red and early type, respectively), whereas for blue + green [(u − r)<2.3] galaxies it remains constant. Late-type galaxies show a small decrease of the median Sérsic index towards redder filters. Also, the Sérsic index of galaxies, and thus their structural properties, do not significantly vary as a function of clustercentric distance and density within the cluster; and this is the case regardless of the filter.


2021 ◽  
Author(s):  
Xin Sui ◽  
Wanjing Wang ◽  
Jinfeng Zhang

In this work, we trained an ensemble model for predicting drug-protein interactions within a sentence based on only its semantics. Our ensembled model was built using three separate models: 1) a classification model using a fine-tuned BERT model; 2) a fine-tuned sentence BERT model that embeds every sentence into a vector; and 3) another classification model using a fine-tuned T5 model. In all models, we further improved performance using data augmentation. For model 2, we predicted the label of a sentence using k-nearest neighbors with its embedded vector. We also explored ways to ensemble these 3 models: a) we used the majority vote method to ensemble these 3 models; and b) based on the HDBSCAN clustering algorithm, we trained another ensemble model using features from all the models to make decisions. Our best model achieved an F-1 score of 0.753 on the BioCreative VII Track 1 test dataset.


2021 ◽  
Vol 8 (10) ◽  
pp. 43-50
Author(s):  
Truong et al. ◽  

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.


2021 ◽  
Author(s):  
Pasi Fränti ◽  
Sami Sieranoja ◽  
Katja Wikström ◽  
Tiina Laatikainen

BACKGROUND Patients with multiple chronic diseases cause a major burden to the health service system. Currently, diseases are mostly treated separately without paying enough attention to their relationships, which results in a fragmentation of the care process. Better integration of services can lead to more effective organization of the overall health care system. OBJECTIVE To analyze the connections between diseases based on their co-occurrences in order to support decision-makers in better organizing health care services. METHODS We performed cluster analysis of diagnosis using data from the Finnish Health Care Registers for primary and specialized health care visits and inpatient care. The target population of this study comprised all individuals aged 18 years or older who used health care services during the years 2015–2018. Clustering was performed based on the co-occurrence of diagnoses. The more the same pair of diagnoses appears in the records of same patients, the more the diagnoses correlate. Based on the co-occurrences, we calculated the relative risk of each pair of diagnoses and clustered the data using a graph-based clustering algorithm called M-algorithm, a variant of k-means. RESULTS The results reveal multimorbidity clusters, of which some are expected, for example one representing hypertensive and cardiovascular diseases. Other clusters are more unexpected, such as a cluster containing lower respiratory tract diseases and systemic connective tissue disorders. We also report the estimated cost effect of each cluster to society. CONCLUSIONS The method and achieved results provide new insight to identify key multimorbidity groups, especially ones resulting in burden and costs in health care services.


2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Ziqi Jia ◽  
Ling Song

The k-prototypes algorithm is a hybrid clustering algorithm that can process Categorical Data and Numerical Data. In this study, the method of initial Cluster Center selection was improved and a new Hybrid Dissimilarity Coefficient was proposed. Based on the proposed Hybrid Dissimilarity Coefficient, a weighted k-prototype clustering algorithm based on the hybrid dissimilarity coefficient was proposed (WKPCA). The proposed WKPCA algorithm not only improves the selection of initial Cluster Centers, but also puts a new method to calculate the dissimilarity between data objects and Cluster Centers. The real dataset of UCI was used to test the WKPCA algorithm. Experimental results show that WKPCA algorithm is more efficient and robust than other k-prototypes algorithms.


2013 ◽  
Vol 215 ◽  
pp. 55-73 ◽  
Author(s):  
Liang Bai ◽  
Jiye Liang ◽  
Chuangyin Dang ◽  
Fuyuan Cao

Sign in / Sign up

Export Citation Format

Share Document