A spectral-based clustering algorithm for categorical data using data summaries

In this study using Data Mining, namely K-Means Clustering. Data Mining can be used in searching for a large enough data analysis that aims to enable Indocomputer to know and classify service data based on customer complaints using Weka Software. In this study using the algorithm K-Means Clustering to predict or classify complaints about hardware damage on Payakumbuh Indocomputer. And can find out the data of Laptop brands most do service on Indocomputer Payakumbuh as one of the recommendations to consumers for the selection of Laptops.

Download Full-text

A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

Algorithms ◽

10.3390/a14060184 ◽

2021 ◽

Vol 14 (6) ◽

pp. 184

Author(s):

Xia Que ◽

Siyuan Jiang ◽

Jiaoyun Yang ◽

Ning An

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Numerical Data ◽

Similarity Measurement ◽

Amount Of Information ◽

Automatic Categorization ◽

Categorical Attributes ◽

Weighting Strategy

Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.

Download Full-text

Performance evaluation of Distance based Angular Clustering Algorithm (DACA) using data aggregation for heterogeneous WSN

2016 International Conference on Computation of Power, Energy Information and Commuincation (ICCPEIC) ◽

10.1109/iccpeic.2016.7557231 ◽

2016 ◽

Cited By ~ 1

Author(s):

Navjot Kumar ◽

Surinder Kaur

Keyword(s):

Performance Evaluation ◽

Data Aggregation ◽

Clustering Algorithm ◽

Using Data

Download Full-text

An environmental dependence of the physical and structural properties in the Hydra cluster galaxies

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3326 ◽

2020 ◽

Vol 500 (1) ◽

pp. 1323-1339

Author(s):

Ciria Lima-Dias ◽

Antonela Monachesi ◽

Sergio Torres-Flores ◽

Arianna Cortesi ◽

Daniel Hernández-Lang ◽

...

Keyword(s):

Structural Properties ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Broad Band ◽

Visible Region ◽

Early Type ◽

Cluster Galaxies ◽

Using Data ◽

Stellar Masses ◽

Environmental Dependence

ABSTRACT The nearby Hydra cluster (∼50 Mpc) is an ideal laboratory to understand, in detail, the influence of the environment on the morphology and quenching of galaxies in dense environments. We study the Hydra cluster galaxies in the inner regions (1R200) of the cluster using data from the Southern Photometric Local Universe Survey, which uses 12 narrow and broad-band filters in the visible region of the spectrum. We analyse structural (Sérsic index, effective radius) and physical (colours, stellar masses, and star formation rates) properties. Based on this analysis, we find that ∼88 per cent of the Hydra cluster galaxies are quenched. Using the Dressler–Schectman test approach, we also find that the cluster shows possible substructures. Our analysis of the phase-space diagram together with density-based spatial clustering algorithm indicates that Hydra shows an additional substructure that appears to be in front of the cluster centre, which is still falling into it. Our results, thus, suggest that the Hydra cluster might not be relaxed. We analyse the median Sérsic index as a function of wavelength and find that for red [(u − r) ≥2.3] and early-type galaxies it displays a slight increase towards redder filters (13 and 18 per cent, for red and early type, respectively), whereas for blue + green [(u − r)<2.3] galaxies it remains constant. Late-type galaxies show a small decrease of the median Sérsic index towards redder filters. Also, the Sérsic index of galaxies, and thus their structural properties, do not significantly vary as a function of clustercentric distance and density within the cluster; and this is the case regardless of the filter.

Download Full-text

Text Mining Drug-Protein Interactions using an Ensemble of BERT, Sentence BERT and T5 models

10.1101/2021.10.26.465944 ◽

2021 ◽

Author(s):

Xin Sui ◽

Wanjing Wang ◽

Jinfeng Zhang

Keyword(s):

Protein Interactions ◽

Clustering Algorithm ◽

Data Augmentation ◽

Majority Vote ◽

Classification Model ◽

Ensemble Model ◽

K Nearest Neighbors ◽

Test Dataset ◽

Improved Performance ◽

Using Data

In this work, we trained an ensemble model for predicting drug-protein interactions within a sentence based on only its semantics. Our ensembled model was built using three separate models: 1) a classification model using a fine-tuned BERT model; 2) a fine-tuned sentence BERT model that embeds every sentence into a vector; and 3) another classification model using a fine-tuned T5 model. In all models, we further improved performance using data augmentation. For model 2, we predicted the label of a sentence using k-nearest neighbors with its embedded vector. We also explored ways to ensemble these 3 models: a) we used the majority vote method to ensemble these 3 models; and b) based on the HDBSCAN clustering algorithm, we trained another ensemble model using features from all the models to make decisions. Our best model achieved an F-1 score of 0.753 on the BioCreative VII Track 1 test dataset.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

Clustering Diagnoses from 58M Patient Visits in Finland 2015–2018 (Preprint)

10.2196/preprints.35422 ◽

2021 ◽

Author(s):

Pasi Fränti ◽

Sami Sieranoja ◽

Katja Wikström ◽

Tiina Laatikainen

Keyword(s):

Health Care ◽

Health Care Services ◽

Clustering Algorithm ◽

Target Population ◽

Care Process ◽

Care Services ◽

System P ◽

Integration Of Services ◽

Effective Organization ◽

Using Data

BACKGROUND Patients with multiple chronic diseases cause a major burden to the health service system. Currently, diseases are mostly treated separately without paying enough attention to their relationships, which results in a fragmentation of the care process. Better integration of services can lead to more effective organization of the overall health care system. OBJECTIVE To analyze the connections between diseases based on their co-occurrences in order to support decision-makers in better organizing health care services. METHODS We performed cluster analysis of diagnosis using data from the Finnish Health Care Registers for primary and specialized health care visits and inpatient care. The target population of this study comprised all individuals aged 18 years or older who used health care services during the years 2015–2018. Clustering was performed based on the co-occurrence of diagnoses. The more the same pair of diagnoses appears in the records of same patients, the more the diagnoses correlate. Based on the co-occurrences, we calculated the relative risk of each pair of diagnoses and clustered the data using a graph-based clustering algorithm called M-algorithm, a variant of k-means. RESULTS The results reveal multimorbidity clusters, of which some are expected, for example one representing hypertensive and cardiovascular diseases. Other clusters are more unexpected, such as a cluster containing lower respiratory tract diseases and systemic connective tissue disorders. We also report the estimated cost effect of each cluster to society. CONCLUSIONS The method and achieved results provide new insight to identify key multimorbidity groups, especially ones resulting in burden and costs in health care services.

Download Full-text

School-Based Management Performance Efficiency Modeling and Profiling using Data Envelopment Analysis and K-Means Clustering Algorithm

2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS) ◽

10.1109/ccoms.2019.8821732 ◽

2019 ◽

Author(s):

Jona P. Tibay ◽

Shaneth C. Ambat ◽

Ace C. Lagman

Keyword(s):

Data Envelopment Analysis ◽

Clustering Algorithm ◽

Data Envelopment ◽

Management Performance ◽

School Based ◽

School Based Management ◽

Using Data

Download Full-text

Weighted k-Prototypes Clustering Algorithm Based on the Hybrid Dissimilarity Coefficient

Mathematical Problems in Engineering ◽

10.1155/2020/5143797 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Ziqi Jia ◽

Ling Song

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Numerical Data ◽

Experimental Results ◽

Cluster Center ◽

Real Dataset ◽

Dissimilarity Coefficient ◽

Initial Cluster ◽

Data Objects ◽

Selection Of

The k-prototypes algorithm is a hybrid clustering algorithm that can process Categorical Data and Numerical Data. In this study, the method of initial Cluster Center selection was improved and a new Hybrid Dissimilarity Coefficient was proposed. Based on the proposed Hybrid Dissimilarity Coefficient, a weighted k-prototype clustering algorithm based on the hybrid dissimilarity coefficient was proposed (WKPCA). The proposed WKPCA algorithm not only improves the selection of initial Cluster Centers, but also puts a new method to calculate the dissimilarity between data objects and Cluster Centers. The real dataset of UCI was used to test the WKPCA algorithm. Experimental results show that WKPCA algorithm is more efficient and robust than other k-prototypes algorithms.

Download Full-text