HYBRID DECISION TREE ARCHITECTURE UTILIZING LOCAL SVMs FOR EFFICIENT MULTI-LABEL LEARNING

Author(s):  
DEJAN GJORGJEVIKJ ◽  
GJORGJI MADJAROV ◽  
SAŠO DŽEROSKI

Multi-label learning (MLL) problems abound in many areas, including text categorization, protein function classification, and semantic annotation of multimedia. Issues that severely limit the applicability of many current machine learning approaches to MLL are the large-scale problem, which have a strong impact on the computational complexity of learning. These problems are especially pronounced for approaches that transform MLL problems into a set of binary classification problems for which Support Vector Machines (SVMs) are used. On the other hand, the most efficient approaches to MLL, based on decision trees, have clearly lower predictive performance. We propose a hybrid decision tree architecture, where the leaves do not give multi-label predictions directly, but rather utilize local SVM-based classifiers giving multi-label predictions. A binary relevance architecture is employed in the leaves, where a binary SVM classifier is built for each of the labels relevant to that particular leaf. We use a broad range of multi-label datasets with a variety of evaluation measures to evaluate the proposed method against related and state-of-the-art methods, both in terms of predictive performance and time complexity. Our hybrid architecture on almost every large classification problem outperforms the competing approaches in terms of the predictive performance, while its computational efficiency is significantly improved as a result of the integrated decision tree.

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Wenhao Xie ◽  
Yanhong She ◽  
Qiao Guo

Support vector machines (SVMs) are designed to solve the binary classification problems at the beginning, but in the real world, there are a lot of multiclassification cases. The multiclassification methods based on SVM are mainly divided into the direct methods and the indirect methods, in which the indirect methods, which consist of multiple binary classifiers integrated in accordance with certain rules to form the multiclassification model, are the most commonly used multiclassification methods at present. In this paper, an improved multiclassification algorithm based on the balanced binary decision tree is proposed, which is called the IBDT-SVM algorithm. In this algorithm, it considers not only the influence of “between-classes distance” and “class variance” in traditional measures of between-classes separability but also takes “between-classes variance” into consideration and proposes a new improved “between-classes separability measure.” Based on the new “between-classes separability measure,” it finds out the two classes with the largest between-classes separability measure and uses them as the positive and negative samples to train and learn the classifier. After that, according to the principle of the class-grouping-by-majority, the remaining classes are close to these two classes and merged into the positive samples and the negative samples to train SVM classifier again. For the samples with uneven distribution or sparse distribution, this method can avoid the error caused by the shortest canter distance classification method and overcome the “error accumulation” problem existing in traditional binary decision tree to the greatest extent so as to obtain a better classifier. According to the above algorithm, each layer node of the decision tree is traversed until the output classification result is a single-class label. The experimental results show that the IBDT-SVM algorithm proposed in this paper can achieve better classification accuracy and effectiveness for multiple classification problems.


2021 ◽  
Vol 40 (1) ◽  
pp. 1481-1494
Author(s):  
Geng Deng ◽  
Yaoguo Xie ◽  
Xindong Wang ◽  
Qiang Fu

Many classification problems contain shape information from input features, such as monotonic, convex, and concave. In this research, we propose a new classifier, called Shape-Restricted Support Vector Machine (SR-SVM), which takes the component-wise shape information to enhance classification accuracy. There exists vast research literature on monotonic classification covering monotonic or ordinal shapes. Our proposed classifier extends to handle convex and concave types of features, and combinations of these types. While standard SVM uses linear separating hyperplanes, our novel SR-SVM essentially constructs non-parametric and nonlinear separating planes subject to component-wise shape restrictions. We formulate SR-SVM classifier as a convex optimization problem and solve it using an active-set algorithm. The approach applies basis function expansions on the input and effectively utilizes the standard SVM solver. We illustrate our methodology using simulation and real world examples, and show that SR-SVM improves the classification performance with additional shape information of input.


2021 ◽  
Vol 22 (16) ◽  
pp. 8958
Author(s):  
Phasit Charoenkwan ◽  
Chanin Nantasenamat ◽  
Md. Mehedi Hasan ◽  
Mohammad Ali Moni ◽  
Pietro Lio’ ◽  
...  

Accurate identification of bitter peptides is of great importance for better understanding their biochemical and biophysical properties. To date, machine learning-based methods have become effective approaches for providing a good avenue for identifying potential bitter peptides from large-scale protein datasets. Although few machine learning-based predictors have been developed for identifying the bitterness of peptides, their prediction performances could be improved. In this study, we developed a new predictor (named iBitter-Fuse) for achieving more accurate identification of bitter peptides. In the proposed iBitter-Fuse, we have integrated a variety of feature encoding schemes for providing sufficient information from different aspects, namely consisting of compositional information and physicochemical properties. To enhance the predictive performance, the customized genetic algorithm utilizing self-assessment-report (GA-SAR) was employed for identifying informative features followed by inputting optimal ones into a support vector machine (SVM)-based classifier for developing the final model (iBitter-Fuse). Benchmarking experiments based on both 10-fold cross-validation and independent tests indicated that the iBitter-Fuse was able to achieve more accurate performance as compared to state-of-the-art methods. To facilitate the high-throughput identification of bitter peptides, the iBitter-Fuse web server was established and made freely available online. It is anticipated that the iBitter-Fuse will be a useful tool for aiding the discovery and de novo design of bitter peptides


2019 ◽  
Vol 19 (03) ◽  
pp. 1950008
Author(s):  
MONALISA MOHANTY ◽  
PRADYUT BISWAL ◽  
SUKANTA SABUT

Ventricular tachycardia (VT) and ventricular fibrillation (VF) are the life-threatening ventricular arrhythmias that require treatment in an emergency. Detection of VT and VF at an early stage is crucial for achieving the success of the defibrillation treatment. Hence an automatic system using computer-aided diagnosis tool is helpful in detecting the ventricular arrhythmias in electrocardiogram (ECG) signal. In this paper, a discrete wavelet transform (DWT) was used to denoise and decompose the ECG signals into different consecutive frequency bands to reduce noise. The methodology was tested using ECG data from standard CU ventricular tachyarrhythmia database (CUDB) and MIT-BIH malignant ventricular ectopy database (VFDB) datasets of PhysioNet databases. A set of time-frequency features consists of temporal, spectral, and statistical were extracted and ranked by the correlation attribute evaluation with ranker search method in order to improve the accuracy of detection. The ranked features were classified for VT and VF conditions using support vector machine (SVM) and decision tree (C4.5) classifier. The proposed DWT based features yielded the average sensitivity of 98%, specificity of 99.32%, and accuracy of 99.23% using a decision tree (C4.5) classifier. These results were better than the SVM classifier having an average accuracy of 92.43%. The obtained results prove that using DWT based time-frequency features with decision tree (C4.5) classifier can be one of the best choices for clinicians for precise detection of ventricular arrhythmias.


mBio ◽  
2020 ◽  
Vol 11 (3) ◽  
Author(s):  
Begüm D. Topçuoğlu ◽  
Nicholas A. Lesniak ◽  
Mack T. Ruffin ◽  
Jenna Wiens ◽  
Patrick D. Schloss

ABSTRACT Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability. IMPORTANCE Diagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely overoptimistic. Moreover, there is a trend toward using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step toward developing more-reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.


2018 ◽  
Vol 8 (12) ◽  
pp. 2574 ◽  
Author(s):  
Qinghua Mao ◽  
Hongwei Ma ◽  
Xuhui Zhang ◽  
Guangming Zhang

Skewness Decision Tree Support Vector Machine (SDTSVM) algorithm is widely known as a supervised learning model for multi-class classification problems. However, the classification accuracy of the SDTSVM algorithm depends on the perfect selection of its parameters and the classification order. Therefore, an improved SDTSVM (ISDTSVM) algorithm is proposed in order to improve the classification accuracy of steel cord conveyor belt defects. In the proposed model, the classification order is determined by the sum of the Euclidean distances between multi-class sample centers and the parameters are optimized by the inertia weight Particle Swarm Optimization (PSO) algorithm. In order to verify the effectiveness of the ISDTSVM algorithm with different feature space, experiments were conducted on multiple UCI (University of California Irvine) data sets and steel cord conveyor belt defects using the proposed ISDTSVM algorithm and the conventional SDTSVM algorithm respectively. The average classification accuracies of five-fold cross-validation were obtained, based on two kinds of kernel functions respectively. For the Vowel, Zoo, and Wine data sets of the UCI data sets, as well as the steel cord conveyor belt defects, the ISDTSVM algorithm improved the classification accuracy by 3%, 3%, 1% and 4% respectively, compared to the SDTSVM algorithm. The classification accuracy of the radial basis function kernel were higher than the polynomial kernel. The results indicated that the proposed ISDTSVM algorithm improved the classification accuracy significantly, compared to the conventional SDTSVM algorithm.


2020 ◽  
Vol 2020 ◽  
pp. 1-10
Author(s):  
Xinke Zhan ◽  
Zhuhong You ◽  
Changqing Yu ◽  
Liping Li ◽  
Jie Pan

Identifying the drug-target interactions (DTIs) plays an essential role in new drug development. However, there still has the limited knowledge of DTIs and a significant number of unknown DTI pairs. Moreover, the traditional experimental methods have inevitable disadvantages such as high cost and time-consuming. Therefore, developing computational methods for predicting DTIs is attracting more and more attention. In this study, we report a novel computational approach for predicting DTI using GIST feature, position-specific scoring matrix (PSSM), and rotation forest (RF). Specifically, each target protein is first converted into a PSSM for retaining evolutionary information. Then, the GIST feature is extracted from PSSM and substructure fingerprint information is adopted to extract the feature of the drug. Finally, combining each protein and drug features to form a new drug-target pair, which is employed as input feature for RF classifier. In the experiment, the proposed method achieves high average accuracies of 89.25%, 85.93%, 82.36%, and 73.89% on enzyme, ion channel, G protein-coupled receptors (GPCRs), and nuclear receptor, respectively. For further evaluating the prediction performance of the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the same golden standard dataset. These promising results illustrate that the proposed method is more effective and stable than other methods. We expect the proposed method to be a useful tool for predicting large-scale DTIs.


2012 ◽  
Vol 2012 ◽  
pp. 1-7 ◽  
Author(s):  
Hao Jiang ◽  
Wai-Ki Ching

High dimensional bioinformatics data sets provide an excellent and challenging research problem in machine learning area. In particular, DNA microarrays generated gene expression data are of high dimension with significant level of noise. Supervised kernel learning with an SVM classifier was successfully applied in biomedical diagnosis such as discriminating different kinds of tumor tissues. Correlation Kernel has been recently applied to classification problems with Support Vector Machines (SVMs). In this paper, we develop a novel and parsimonious positive semidefinite kernel. The proposed kernel is shown experimentally to have better performance when compared to the usual correlation kernel. In addition, we propose a new kernel based on the correlation matrix incorporating techniques dealing with indefinite kernel. The resulting kernel is shown to be positive semidefinite and it exhibits superior performance to the two kernels mentioned above. We then apply the proposed method to some cancer data in discriminating different tumor tissues, providing information for diagnosis of diseases. Numerical experiments indicate that our method outperforms the existing methods such as the decision tree method and KNN method.


2015 ◽  
Vol 713-715 ◽  
pp. 1513-1519 ◽  
Author(s):  
Wei Dong Du ◽  
Bao Wei Chen ◽  
Hai Sen Li ◽  
Chao Xu

In order to solve fish classification problems based on acoustic scattering data, temporal centroid (TC) features and discrete cosine transform (DCT) coefficients features used to analyze acoustic scattering characteristics of fish from different aspects are extracted. The extracted features of fish are reduced in dimension and fused, and support vector machine (SVM) classifier is used to classify and identify the fishes. Three kinds of different fishes are selected as research objects in this paper, the correct identification rates are given based on temporal centroid features and discrete cosine transform coefficients features and fused features. The processing results of actual experimental data show that multi-feature fusion method can improve the identification rate at about 5% effectively.


Sign in / Sign up

Export Citation Format

Share Document