Learning representations from dendrograms

Abstract We propose unsupervised representation learning and feature extraction from dendrograms. The commonly used Minimax distance measures correspond to building a dendrogram with single linkage criterion, with defining specific forms of a level function and a distance function over that. Therefore, we extend this method to arbitrary dendrograms. We develop a generalized framework wherein different distance measures and representations can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, to address the model selection problem, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep representations. In the first approach, for example for the clustering problem, we build a graph with positive and negative edge weights according to the consistency of the clustering labels of different objects among different solutions, in the context of ensemble methods. Then, we use an efficient variant of correlation clustering to produce the final clusters. In the second approach, we investigate the combination of different distances and features sequentially in the spirit of multi-layered architectures to obtain the final features. Finally, we demonstrate the effectiveness of our approach via several numerical studies.

Download Full-text

Context-Based Geodesic Dissimilarity Measure for Clustering Categorical Data

Applied Sciences ◽

10.3390/app11188416 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8416

Author(s):

Changki Lee ◽

Uk Jung

Keyword(s):

Learning Outcomes ◽

Categorical Data ◽

Dissimilarity Measure ◽

Machine Learning Algorithms ◽

Distance Measures ◽

Categorical Variables ◽

Continuous Data ◽

Clustering Problem ◽

Data Clusters ◽

Categorical Data Clustering

Measuring the dissimilarity between two observations is the basis of many data mining and machine learning algorithms, and its effectiveness has a significant impact on learning outcomes. The dissimilarity or distance computation has been a manageable problem for continuous data because many numerical operations can be successfully applied. However, unlike continuous data, defining a dissimilarity between pairs of observations with categorical variables is not straightforward. This study proposes a new method to measure the dissimilarity between two categorical observations, called a context-based geodesic dissimilarity measure, for the categorical data clustering problem. The proposed method considers the relationships between categorical variables and discovers the implicit topological structures in categorical data. In other words, it can effectively reflect the nonlinear patterns of arbitrarily shaped categorical data clusters. Our experimental results confirm that the proposed measure that considers both nonlinear data patterns and relationships among the categorical variables yields better clustering performance than other distance measures.

Download Full-text

Comparison of Ensemble Machine Learning Methods for Soil Erosion Pin Measurements

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10010042 ◽

2021 ◽

Vol 10 (1) ◽

pp. 42

Author(s):

Kieu Anh Nguyen ◽

Walter Chen ◽

Bor-Shiun Lin ◽

Uma Seeboonruang

Keyword(s):

Machine Learning ◽

Soil Erosion ◽

Ensemble Methods ◽

Machine Learning Algorithms ◽

Multivariate Adaptive Regression Splines ◽

Gradient Boosting ◽

Support Vector ◽

Ensemble Machine Learning ◽

Boosting Method ◽

Bagging Method

Although machine learning has been extensively used in various fields, it has only recently been applied to soil erosion pin modeling. To improve upon previous methods of quantifying soil erosion based on erosion pin measurements, this study explored the possible application of ensemble machine learning algorithms to the Shihmen Reservoir watershed in northern Taiwan. Three categories of ensemble methods were considered in this study: (a) Bagging, (b) boosting, and (c) stacking. The bagging method in this study refers to bagged multivariate adaptive regression splines (bagged MARS) and random forest (RF), and the boosting method includes Cubist and gradient boosting machine (GBM). Finally, the stacking method is an ensemble method that uses a meta-model to combine the predictions of base models. This study used RF and GBM as the meta-models, decision tree, linear regression, artificial neural network, and support vector machine as the base models. The dataset used in this study was sampled using stratified random sampling to achieve a 70/30 split for the training and test data, and the process was repeated three times. The performance of six ensemble methods in three categories was analyzed based on the average of three attempts. It was found that GBM performed the best among the ensemble models with the lowest root-mean-square error (RMSE = 1.72 mm/year), the highest Nash-Sutcliffe efficiency (NSE = 0.54), and the highest index of agreement (d = 0.81). This result was confirmed by the spatial comparison of the absolute differences (errors) between model predictions and observations using GBM and RF in the study area. In summary, the results show that as a group, the bagging method and the boosting method performed equally well, and the stacking method was third for the erosion pin dataset considered in this study.

Download Full-text

Calibration-Based Estimators using Different Distance Measures under Two Auxiliary Variables: A Comparative Study

Journal of Modern Applied Statistical Methods ◽

10.22237/jmasm/1619481600 ◽

2021 ◽

Vol 19 (1) ◽

pp. 2-20

Author(s):

Piyush Kant Rai ◽

Alka Singh ◽

Muhammad Qasim

Keyword(s):

Mean Squared Error ◽

Real Life ◽

Distance Functions ◽

Distance Measures ◽

Auxiliary Variables ◽

Data Set ◽

Life Data ◽

Squared Error ◽

Real Life Data ◽

Relative Root

This article introduces calibration estimators under different distance measures based on two auxiliary variables in stratified sampling. The theory of the calibration estimator is presented. The calibrated weights based on different distance functions are also derived. A simulation study has been carried out to judge the performance of the proposed estimators based on the minimum relative root mean squared error criterion. A real-life data set is also used to confirm the supremacy of the proposed method.

Download Full-text

Ensemble Methods for APS In-Flight Particle Temperature and Velocity Prediction Considering Torch Electrodes Ageing

Thermal Spray 2021: Proceedings from the International Thermal Spray Conference ◽

10.31399/asm.cp.itsc2021p0044 ◽

2021 ◽

Author(s):

K.R. Yu ◽

C.V. Cojocaru ◽

F. Ilinca ◽

E. Irissou

Keyword(s):

Powder Particle ◽

Process Parameters ◽

Particle Temperature ◽

Ensemble Methods ◽

Machine Learning Algorithms ◽

Atmospheric Plasma ◽

Gradient Boosting ◽

Input Process ◽

Process Data ◽

Particle Characteristics

Abstract In an atmospheric plasma spray (APS) process; in-flight powder particle characteristics; such as the particle velocity and temperature; have significant influence on the coating formation. The nonlinear relationship between the input process parameters and in-flight particle characteristics is thus of paramount importance for coating properties design and quality control. It is also known that the ageing of torch electrodes affects this relationship. In recent years; machine learning algorithms have proven to be able to take into account such complex nonlinear interactions. This work illustrates the application of ensemble methods based on decision tree algorithms to evaluate and to predict in-flight particle temperature and velocity during an APS process considering torch electrodes ageing. Experiments were performed to record simultaneously the input process parameters; the in-flight powder particle characteristics and the electrodes usage time. Various spray durations were considered to emulate industrial coating spray production settings. Random forest and gradient boosting algorithms were used to rank and select the features for the APS process data recorded as the electrodes aged and the corresponding predictive models were compared. The time series aspect of the data will be examined.

Download Full-text

Approximation algorithms for two variants of correlation clustering problem

Journal of Combinatorial Optimization ◽

10.1007/s10878-020-00612-1 ◽

2020 ◽

Author(s):

Sai Ji ◽

Dachuan Xu ◽

Min Li ◽

Yishui Wang

Keyword(s):

Approximation Algorithms ◽

Correlation Clustering ◽

Clustering Problem

Download Full-text

Comparing distance measures on assessed medical device incident data using Average Silhouette Width

Current Directions in Biomedical Engineering ◽

10.1515/cdbme-2018-0126 ◽

2018 ◽

Vol 4 (1) ◽

pp. 525-528

Author(s):

Christian Bayer ◽

Robin Seidel

Keyword(s):

Distance Measure ◽

Data Preprocessing ◽

Study Data ◽

Machine Learning Algorithms ◽

Distance Measures ◽

Free Text ◽

Silhouette Width ◽

Federal Institute ◽

Cluster Density ◽

Incident Reports

AbstractMany machine learning algorithms depend on the choice of an appropriate similarity or distance measure. Comparing such measures in different domains and on diversely structured data is common, but often performed in regards of an algorithm to cluster or classify the data. In this study, data assessed by experts is analyzed instead. The data is taken from the database of the Federal Institute for Drugs and Medical Devices (BfArM) and represents free text incident reports. The Average Silhouette Width, a cluster density measure, is used to compare the distance measures’ ability to discriminate the data according to the experts’ assessments. The Euclidean distance and four distance measures derived from the Jaccard similarity, the Simple Matching similarity, the Cosine similarity and the Yule similarity are compared on four subsets of this database. The results show, that a better data preprocessing is necessary, possibly due to boilerplate texts being used to write incident reports. These results will also provide the basis to compare improvements by different methods of data preprocessing in the future.

Download Full-text

On the hardness of labeled correlation clustering problem: A parameterized complexity view

Theoretical Computer Science ◽

10.1016/j.tcs.2015.03.021 ◽

2016 ◽

Vol 609 ◽

pp. 583-593

Author(s):

Xianmin Liu ◽

Jianzhong Li ◽

Hong Gao

Keyword(s):

Parameterized Complexity ◽

Correlation Clustering ◽

Clustering Problem

Download Full-text

An FPT Algorithm for the Correlation Clustering Problem

Key Engineering Materials ◽

10.4028/www.scientific.net/kem.474-476.924 ◽

2011 ◽

Vol 474-476 ◽

pp. 924-927 ◽

Cited By ~ 1

Author(s):

Xiao Xin

Keyword(s):

Approximation Algorithms ◽

Polynomial Time ◽

Undirected Graph ◽

Fixed Parameter Tractable ◽

Correlation Clustering ◽

Time Approximation ◽

Running Time ◽

Clustering Problem ◽

Fpt Algorithm ◽

Fixed Parameter

Given an undirected graph G=(V, E) with real nonnegative weights and + or – labels on its edges, the correlation clustering problem is to partition the vertices of G into clusters to minimize the total weight of cut + edges and uncut – edges. This problem is APX-hard and has been intensively studied mainly from the viewpoint of polynomial time approximation algorithms. By way of contrast, a fixed-parameter tractable algorithm is presented that takes treewidth as the parameter, with a running time that is linear in the number of vertices of G.

Download Full-text

Convolutional neural network-based ensemble methods to recognize Bangla handwritten character

PeerJ Computer Science ◽

10.7717/peerj-cs.565 ◽

2021 ◽

Vol 7 ◽

pp. e565

Author(s):

Mir Moynuddin Ahmed Shibly ◽

Tahmina Akter Tisha ◽

Tanzina Akter Tani ◽

Shamim Ripon

Keyword(s):

Character Recognition ◽

Autonomous System ◽

Large Scale ◽

Ensemble Methods ◽

Office Automation ◽

Machine Learning Algorithms ◽

Handwritten Character ◽

Handwritten Text ◽

Feature Extractor ◽

Handwritten Recognition

In this era of advancements in deep learning, an autonomous system that recognizes handwritten characters and texts can be eventually integrated with the software to provide better user experience. Like other languages, Bangla handwritten text extraction also has various applications such as post-office automation, signboard recognition, and many more. A large-scale and efficient isolated Bangla handwritten character classifier can be the first building block to create such a system. This study aims to classify the handwritten Bangla characters. The proposed methods of this study are divided into three phases. In the first phase, seven convolutional neural networks i.e., CNN-based architectures are created. After that, the best performing CNN model is identified, and it is used as a feature extractor. Classifiers are then obtained by using shallow machine learning algorithms. In the last phase, five ensemble methods have been used to achieve better performance in the classification task. To systematically assess the outcomes of this study, a comparative analysis of the performances has also been carried out. Among all the methods, the stacked generalization ensemble method has achieved better performance than the other implemented methods. It has obtained accuracy, precision, and recall of 98.68%, 98.69%, and 98.68%, respectively on the Ekush dataset. Moreover, the use of CNN architectures and ensemble methods in large-scale Bangla handwritten character recognition has also been justified by obtaining consistent results on the BanglaLekha-Isolated dataset. Such efficient systems can move the handwritten recognition to the next level so that the handwriting can easily be automated.

Download Full-text

A Hybrid Meta-Learner Technique for Credit Scoring of Banks’ Customers

Engineering, Technology & Applied Science Research ◽

10.48084/etasr.1361 ◽

2017 ◽

Vol 7 (5) ◽

pp. 2073-2082 ◽

Cited By ~ 1

Author(s):

A. G. Armaki ◽

M. F. Fallah ◽

M. Alborzi ◽

A. Mohammadzadeh

Keyword(s):

Machine Learning ◽

Hybrid Model ◽

Credit Scoring ◽

Clustering Algorithms ◽

Real Life ◽

Ensemble Methods ◽

Scoring Systems ◽

Error Rates ◽

Machine Learning Algorithms ◽

Machine Learning Techniques

Financial institutions are exposed to credit risk due to issuance of consumer loans. Thus, developing reliable credit scoring systems is very crucial for them. Since, machine learning techniques have demonstrated their applicability and merit, they have been extensively used in credit scoring literature. Recent studies concentrating on hybrid models through merging various machine learning algorithms have revealed compelling results. There are two types of hybridization methods namely traditional and ensemble methods. This study combines both of them and comes up with a hybrid meta-learner model. The structure of the model is based on the traditional hybrid model of ‘classification + clustering’ in which the stacking ensemble method is employed in the classification part. Moreover, this paper compares several versions of the proposed hybrid model by using various combinations of classification and clustering algorithms. Hence, it helps us to identify which hybrid model can achieve the best performance for credit scoring purposes. Using four real-life credit datasets, the experimental results show that the model of (KNN-NN-SVMPSO)-(DL)-(DBSCAN) delivers the highest prediction accuracy and the lowest error rates.

Download Full-text