A cross-entropy based stacking method in ensemble learning

Stacking is one of the major types of ensemble learning techniques in which a set of base classifiers contributes their outputs to the meta-level classifier, and the meta-level classifier combines them so as to produce more accurate classifications. In this paper, we propose a new stacking algorithm that defines the cross-entropy as the loss function for the classification problem. The training process is conducted by using a neural network with the stochastic gradient descent technique. One major characteristic of our method is its treatment of each meta instance as a whole with one optimization model, which is different from some other stacking methods such as stacking with multi-response linear regression and stacking with multi-response model trees. In these methods each meta instance is divided into a set of sub-instances. Multiple models apply to those sub-instances and each for a class label. There is no connection between different models. It is very likely that our treatment is a better choice for finding suitable weights. Experiments with 22 data sets from the UCI machine learning repository show that the proposed stacking approach performs well. It outperforms all three base classifiers, several state-of-the-art stacking algorithms, and some other representative ensemble learning methods on average.

Download Full-text

gbt-HIPS: Explaining the Classifications of Gradient Boosted Tree Ensembles

Applied Sciences ◽

10.3390/app11062511 ◽

2021 ◽

Vol 11 (6) ◽

pp. 2511

Author(s):

Julian Hatwell ◽

Mohamed Medhat Gaber ◽

R. Muhammad Atif Azad

Keyword(s):

State Of The Art ◽

Heuristic Method ◽

Good Explanation ◽

Classification Rule ◽

Data Sets ◽

Classification Models ◽

Boundary Values ◽

Class Label ◽

Input Space ◽

Boosted Tree

This research presents Gradient Boosted Tree High Importance Path Snippets (gbt-HIPS), a novel, heuristic method for explaining gradient boosted tree (GBT) classification models by extracting a single classification rule (CR) from the ensemble of decision trees that make up the GBT model. This CR contains the most statistically important boundary values of the input space as antecedent terms. The CR represents a hyper-rectangle of the input space inside which the GBT model is, very reliably, classifying all instances with the same class label as the explanandum instance. In a benchmark test using nine data sets and five competing state-of-the-art methods, gbt-HIPS offered the best trade-off between coverage (0.16–0.75) and precision (0.85–0.98). Unlike competing methods, gbt-HIPS is also demonstrably guarded against under- and over-fitting. A further distinguishing feature of our method is that, unlike much prior work, our explanations also provide counterfactual detail in accordance with widely accepted recommendations for what makes a good explanation.

Download Full-text

A new stochastic gradient descent possibilistic clustering algorithm

AI Communications ◽

10.3233/aic-210125 ◽

2021 ◽

pp. 1-18

Author(s):

Angeliki Koutsimpela ◽

Konstantinos D. Koutroumbas

Keyword(s):

Cost Function ◽

Gradient Descent ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Data ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Data Sets ◽

Convergence Results ◽

Possibilistic Clustering

Several well known clustering algorithms have their own online counterparts, in order to deal effectively with the big data issue, as well as with the case where the data become available in a streaming fashion. However, very few of them follow the stochastic gradient descent philosophy, despite the fact that the latter enjoys certain practical advantages (such as the possibility of (a) running faster than their batch processing counterparts and (b) escaping from local minima of the associated cost function), while, in addition, strong theoretical convergence results have been established for it. In this paper a novel stochastic gradient descent possibilistic clustering algorithm, called O- PCM 2 is introduced. The algorithm is presented in detail and it is rigorously proved that the gradient of the associated cost function tends to zero in the L 2 sense, based on general convergence results established for the family of the stochastic gradient descent algorithms. Furthermore, an additional discussion is provided on the nature of the points where the algorithm may converge. Finally, the performance of the proposed algorithm is tested against other related algorithms, on the basis of both synthetic and real data sets.

Download Full-text

Regularized Instance Embedding for Deep Multi-Instance Learning

Applied Sciences ◽

10.3390/app10010064 ◽

2019 ◽

Vol 10 (1) ◽

pp. 64

Author(s):

Yi Lin ◽

Honggang Zhang

Keyword(s):

Neural Network ◽

Big Data ◽

Supervised Learning ◽

Regularization Method ◽

Gradient Descent ◽

State Of The Art ◽

Stochastic Gradient Descent ◽

Learning Framework ◽

Weakly Supervised ◽

The Cost

In the era of Big Data, multi-instance learning, as a weakly supervised learning framework, has various applications since it is helpful to reduce the cost of the data-labeling process. Due to this weakly supervised setting, learning effective instance representation/embedding is challenging. To address this issue, we propose an instance-embedding regularizer that can boost the performance of both instance- and bag-embedding learning in a unified fashion. Specifically, the crux of the instance-embedding regularizer is to maximize correlation between instance-embedding and underlying instance-label similarities. The embedding-learning framework was implemented using a neural network and optimized in an end-to-end manner using stochastic gradient descent. In experiments, various applications were studied, and the results show that the proposed instance-embedding-regularization method is highly effective, having state-of-the-art performance.

Download Full-text

Online Learning Based on Online DCA and Application to Online Classification

Neural Computation ◽

10.1162/neco_a_01266 ◽

2020 ◽

Vol 32 (4) ◽

pp. 759-793 ◽

Cited By ~ 2

Author(s):

Hoai An Le Thi ◽

Vinh Thanh Ho

Keyword(s):

Online Learning ◽

State Of The Art ◽

Classification Algorithms ◽

Data Sets ◽

Prediction Problem ◽

Dc Algorithm ◽

Difference Of Convex Functions ◽

Online Classification ◽

Learning Techniques ◽

Difference Of Convex

We investigate an approach based on DC (Difference of Convex functions) programming and DCA (DC Algorithm) for online learning techniques. The prediction problem of an online learner can be formulated as a DC program for which online DCA is applied. We propose the two so-called complete/approximate versions of online DCA scheme and prove their logarithmic/sublinear regrets. Six online DCA-based algorithms are developed for online binary linear classification. Numerical experiments on a variety of benchmark classification data sets show the efficiency of our proposed algorithms in comparison with the state-of-the-art online classification algorithms.

Download Full-text

Multi-View Multi-Label Learning with View-Specific Information Extraction

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/539 ◽

2019 ◽

Cited By ~ 4

Author(s):

Xuan Wu ◽

Qing-Guo Chen ◽

Yao Hu ◽

Dengbao Wang ◽

Xiaodong Chang ◽

...

Keyword(s):

Information Extraction ◽

Real World ◽

State Of The Art ◽

Specific Information ◽

Learning Approach ◽

Data Sets ◽

Learning Approaches ◽

Real World Data ◽

Learning Techniques ◽

Shared Information

Multi-view multi-label learning serves an important framework to learn from objects with diverse representations and rich semantics. Existing multi-view multi-label learning techniques focus on exploiting shared subspace for fusing multi-view representations, where helpful view-specific information for discriminative modeling is usually ignored. In this paper, a novel multi-view multi-label learning approach named SIMM is proposed which leverages shared subspace exploitation and view-specific information extraction. For shared subspace exploitation, SIMM jointly minimizes confusion adversarial loss and multi-label loss to utilize shared information from all views. For view-specific information extraction, SIMM enforces an orthogonal constraint w.r.t. the shared subspace to utilize view-specific discriminative information. Extensive experiments on real-world data sets clearly show the favorable performance of SIMM against other state-of-the-art multi-view multi-label learning approaches.

Download Full-text

Parameter calibration with stochastic gradient descent for interacting particle systems driven by neural networks

Mathematics of Control Signals and Systems ◽

10.1007/s00498-021-00309-8 ◽

2021 ◽

Author(s):

Simone Göttlich ◽

Claudia Totzeck

Keyword(s):

Gradient Descent ◽

Interacting Particle Systems ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Force Model ◽

Data Sets ◽

Optimal Controls ◽

Parameter Calibration ◽

Descent Algorithm ◽

Gradient Descent Algorithm

AbstractWe propose a neural network approach to model general interaction dynamics and an adjoint-based stochastic gradient descent algorithm to calibrate its parameters. The parameter calibration problem is considered as optimal control problem that is investigated from a theoretical and numerical point of view. We prove the existence of optimal controls, derive the corresponding first-order optimality system and formulate a stochastic gradient descent algorithm to identify parameters for given data sets. To validate the approach, we use real data sets from traffic and crowd dynamics to fit the parameters. The results are compared to forces corresponding to well-known interaction models such as the Lighthill–Whitham–Richards model for traffic and the social force model for crowd motion.

Download Full-text

Stochastic Gradient Descent Based K-Means Algorithm on Large Scale Data Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.1342 ◽

2014 ◽

Vol 687-691 ◽

pp. 1342-1345 ◽

Cited By ~ 1

Author(s):

Jie Ding ◽

Li Peng Zhu ◽

Bin Hu ◽

Ren Long Hang ◽

Yu Bao Sun

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Clustering Algorithm ◽

Distance Matrix ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Data Sets ◽

Human Beings ◽

Large Scale Data ◽

Scale Data

With the rapid advance of data collection and storage technique, it is easy to acquire tens of millions or even billions of data sets. How to explore and exploit the useful or interesting information for human beings from these data sets has become an urgent issue. Traditional k-means clustering algorithm has been widely used in data mining community. First, randomly initialize k clustering centres. Then, all instances are classified into k different classes according to their distances to clustering centres. Lastly, update the clustering centres by the mean of its corresponding constituent instances. This whole process will be iterated until convergence. Obviously, at each iteration, distance matrix from all instances to k clustering centres must be calculated which will cost so much time when encounter large scale data sets. To address this issue, in this paper, we proposed a fast optimization algorithm based on stochastic gradient descent (SGD). At each iteration, randomly choose an instance, search its corresponding clustering centre and then update it immediately. Experimental results show that our proposed method achieves a competitive clustering results with less time cost.

Download Full-text

FNNC: Achieving Fairness through Neural Networks

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/315 ◽

2020 ◽

Author(s):

Manisha Padala ◽

Sujit Gujar

Keyword(s):

Gradient Descent ◽

Optimization Problem ◽

State Of The Art ◽

High Accuracy ◽

Stochastic Gradient Descent ◽

Classification Models ◽

Constrained Optimization Problem ◽

Lagrangian Multipliers ◽

Fairness Constraints ◽

Generalization Errors

In classification models, fairness can be ensured by solving a constrained optimization problem. We focus on fairness constraints like Disparate Impact, Demographic Parity, and Equalized Odds, which are non-decomposable and non-convex. Researchers define convex surrogates of the constraints and then apply convex optimization frameworks to obtain fair classifiers. Surrogates serve as an upper bound to the actual constraints, and convexifying fairness constraints is challenging. We propose a neural network-based framework, \emph{FNNC}, to achieve fairness while maintaining high accuracy in classification. The above fairness constraints are included in the loss using Lagrangian multipliers. We prove bounds on generalization errors for the constrained losses which asymptotically go to zero. The network is optimized using two-step mini-batch stochastic gradient descent. Our experiments show that FNNC performs as good as the state of the art, if not better. The experimental evidence supplements our theoretical guarantees. In summary, we have an automated solution to achieve fairness in classification, which is easily extendable to many fairness constraints.

Download Full-text

A Distributed Neural Network Training Method Based on Hybrid Gradient Computing

Scalable Computing Practice and Experience ◽

10.12694/scpe.v21i2.1727 ◽

2020 ◽

Vol 21 (2) ◽

pp. 323-336

Author(s):

Zhen Lu ◽

Meng Lu ◽

Yan Liang

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Stochastic Gradient Descent ◽

Data Sets ◽

Training Methods ◽

Training Time ◽

Training Strategy ◽

Distributed Training ◽

Mixed Precision ◽

Improved Performance

The application of deep learning in industry often needs to train large-scale neural networks and use large-scale data sets. However, larger networks and larger data sets lead to longer training time, which hinders the research of algorithms and the progress of actual engineering development. Data-parallel distributed training is a commonly used solution, but it is still in the stage of technical exploration. In this paper, we study how to improve the training accuracy and speed of distributed training, and propose a distributed training strategy based on hybrid gradient computing. Specifically, in the gradient descent stage, we propose a hybrid method, which combines a new warmup scheme with the linear-scaling stochastic gradient descent (SGD) algorithm to effectively improve the training accuracy and convergence rate. At the same time, we adopt the mixed precision gradient computing. In the single-GPU gradient computing and inter-GPU gradient synchronization, we use the mixed numerical precision of single precision (FP32) and half precision (FP16), which not only improves the training speed of single-GPU, but also improves the speed of inter-GPU communication. Through the integration of various training strategies and system engineering implementation, we finished ResNet-50 training in 20 minutes on a cluster of 24 V100 GPUs, with 75.6% Top-1 accuracy, and 97.5% GPU scaling efficiency. In addition, this paper proposes a new criterion for the evaluation of the distributed training efficiency, that is, the actual average single-GPU training time, which can evaluate the improvement of training methods in a more reasonable manner than just the improved performance due to the increased number of GPUs. In terms of this criterion, our method outperforms those existing methods.

Download Full-text

Porter 5: state-of-the-art ab initio prediction of protein secondary structure in 3 and 8 classes

10.1101/289033 ◽

2018 ◽

Cited By ~ 17

Author(s):

Mirko Torrisi ◽

Manaz Kaleel ◽

Gianluca Pollastri

Keyword(s):

Secondary Structure ◽

Ab Initio ◽

State Of The Art ◽

Protein Secondary Structure ◽

Independent Set ◽

Machine Learning Techniques ◽

Supplementary Information ◽

Data Sets ◽

Learning Techniques ◽

Ab Initio Prediction

AbstractMotivationAlthough Secondary Structure Predictors have been developed for more than 60 years, current ab initio methods have still some way to go to reach their theoretical limits. Moreover, the continuous effort towards harnessing ever increasing data sets and more sophisticated, deeper Machine Learning techniques, has not come to an end.ResultsHere we present Porter 5, the last release of one of the best performing ab initio secondary structure predictor. Version 5 achieves 84% accuracy (84% SOV) when tested on 3 classes, and 73% accuracy (82% SOV) on 8 classes, on a large independent set, significantly outperforming all the most recent ab initio predictors we have tested.AvailabilityThe web and standalone versions of Porter5 are available at http://distilldeep.ucd.ie/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text