scholarly journals SUBiNN: a stacked uni- and bivariate kNN sparse ensemble

Author(s):  
Tiffany Elsten ◽  
Mark de Rooij

AbstractNearest Neighbor classification is an intuitive distance-based classification method. It has, however, two drawbacks: (1) it is sensitive to the number of features, and (2) it does not give information about the importance of single features or pairs of features. In stacking, a set of base-learners is combined in one overall ensemble classifier by means of a meta-learner. In this manuscript we combine univariate and bivariate nearest neighbor classifiers that are by itself easily interpretable. Furthermore, we combine these classifiers by a Lasso method that results in a sparse ensemble of nonlinear main and pairwise interaction effects. We christened the new method SUBiNN: Stacked Uni- and Bivariate Nearest Neighbors. SUBiNN overcomes the two drawbacks of simple nearest neighbor methods. In extensive simulations and using benchmark data sets, we evaluate the predictive performance of SUBiNN and compare it to other nearest neighbor ensemble methods as well as Random Forests and Support Vector Machines. Results indicate that SUBiNN often outperforms other nearest neighbor methods, that SUBiNN is well capable of identifying noise features, but that Random Forests is often, but not always, the best classifier.

Author(s):  
Cagatay Catal ◽  
Serkan Tugul ◽  
Basar Akpinar

Software repositories consist of thousands of applications and the manual categorization of these applications into domain categories is very expensive and time-consuming. In this study, we investigate the use of an ensemble of classifiers approach to solve the automatic software categorization problem when the source code is not available. Therefore, we used three data sets (package level/class level/method level) that belong to 745 closed-source Java applications from the Sharejar repository. We applied the Vote algorithm, AdaBoost, and Bagging ensemble methods and the base classifiers were Support Vector Machines, Naive Bayes, J48, IBk, and Random Forests. The best performance was achieved when the Vote algorithm was used. The base classifiers of the Vote algorithm were AdaBoost with J48, AdaBoost with Random Forest, and Random Forest algorithms. We showed that the Vote approach with method attributes provides the best performance for automatic software categorization; these results demonstrate that the proposed approach can effectively categorize applications into domain categories in the absence of source code.


Author(s):  
H. Benjamin Fredrick David ◽  
A. Suruliandi ◽  
S. P. Raja

Ensemble methods fabricate a sequence of classifiers for classifying fresh instances by procuring a weighted vote of their individual predictions. Toning down the error and increasing accuracy is an avant-garde problem in ensemble classification. This paper presents a novel generic object-oriented voting and weighting adapted stacking framework for utilizing an ensemble of classifiers for prediction. This universal framework operates based on the weighted average of the probabilities of any suite of base learners and the final prediction is the aggregate of their respective votes. For illustrative purposes, three familiar heterogeneous classifiers, such as the Support Vector Machine, [Formula: see text]-Nearest Neighbor and Naïve Bayes, are utilized as candidates for ensemble classification using the proposed stacked framework. Further, the ensemble classifier built upon the framework is compared with others and evaluated using various cross-validation levels and percentage splits on a range of benchmark datasets. The outcome distinguishes the framework from the competition. The proposed framework is used to predict the crime propensity of prisoners most accurately, with 99.9901% accuracy.


Author(s):  
JIE JI ◽  
QIANGFU ZHAO

This paper proposes a hybrid learning method to speed up the classification procedure of Support Vector Machines (SVM). Comparing most algorithms trying to decrease the support vectors in an SVM classifier, we focus on reducing the data points that need SVM for classification, and reduce the number of support vectors for each SVM classification. The system uses a Nearest Neighbor Classifier (NNC) to treat data points attentively. In the training phase, the NNC selects data near partial decision boundary, and then trains sub SVM for each Voronoi pair. For classification, most non-boundary data points are classified by NNC directly, while remaining boundary data points are passed to a corresponding local expert SVM. We also propose a data selection method for training reliable expert SVM. Experimental results on several generated and public machine learning data sets show that the proposed method significantly accelerates the testing speed.


2004 ◽  
Vol 3 (1) ◽  
pp. 1-18 ◽  
Author(s):  
Mark R Segal ◽  
Jason D Barbour ◽  
Robert M Grant

The problem of relating genotype (as represented by amino acid sequence) to phenotypes is distinguished from standard regression problems by the nature of sequence data. Here we investigate an instance of such a problem where the phenotype of interest is HIV-1 replication capacity and contiguous segments of protease and reverse transcriptase sequence constitutes genotype. A variety of data analytic methods have been proposed in this context. Shortcomings of select techniques are contrasted with the advantages afforded by tree-structured methods. However, tree-structured methods, in turn, have been criticized on grounds of only enjoying modest predictive performance. A number of ensemble approaches (bagging, boosting, random forests) have recently emerged, devised to overcome this deficiency. We evaluate random forests as applied in this setting, and detail why prediction gains obtained in other situations are not realized. Other approaches including logic regression, support vector machines and neural networks are also applied. We interpret results in terms of HIV-1 reverse transcriptase structure and function.


2020 ◽  
Vol 8 (4) ◽  
pp. 297-303
Author(s):  
Tamunopriye Ene Dagogo-George ◽  
Hammed Adeleye Mojeed ◽  
Abdulateef Oluwagbemiga Balogun ◽  
Modinat Abolore Mabayoje ◽  
Shakirat Aderonke Salihu

Diabetic Retinopathy (DR) is a condition that emerges from prolonged diabetes, causing severe damages to the eyes. Early diagnosis of this disease is highly imperative as late diagnosis may be fatal. Existing studies employed machine learning approaches with Support Vector Machines (SVM) having the highest performance on most analyses and Decision Trees (DT) having the lowest. However, SVM has been known to suffer from parameter and kernel selection problems, which undermine its predictive capability. Hence, this study presents homogenous ensemble classification methods with DT as the base classifier to optimize predictive performance. Boosting and Bagging ensemble methods with feature selection were employed, and experiments were carried out using Python Scikit Learn libraries on DR datasets extracted from UCI Machine Learning repository. Experimental results showed that Bagged and Boosted DT were better than SVM. Specifically, Bagged DT performed best with accuracy 65.38 %, f-score 0.664, and AUC 0.731, followed by Boosted DT with accuracy 65.42 %, f-score 0.655, and AUC 0.724 when compared to SVM (accuracy 65.16 %, f-score 0.652, and AUC 0.721). These results indicate that DT's predictive performance can be optimized by employing the homogeneous ensemble methods to outperform SVM in predicting DR.


2012 ◽  
Vol 24 (4) ◽  
pp. 1047-1084 ◽  
Author(s):  
Xiao-Tong Yuan ◽  
Shuicheng Yan

We investigate Newton-type optimization methods for solving piecewise linear systems (PLSs) with nondegenerate coefficient matrix. Such systems arise, for example, from the numerical solution of linear complementarity problem, which is useful to model several learning and optimization problems. In this letter, we propose an effective damped Newton method, PLS-DN, to find the exact (up to machine precision) solution of nondegenerate PLSs. PLS-DN exhibits provable semiiterative property, that is, the algorithm converges globally to the exact solution in a finite number of iterations. The rate of convergence is shown to be at least linear before termination. We emphasize the applications of our method in modeling, from a novel perspective of PLSs, some statistical learning problems such as box-constrained least squares, elitist Lasso (Kowalski & Torreesani, 2008 ), and support vector machines (Cortes & Vapnik, 1995 ). Numerical results on synthetic and benchmark data sets are presented to demonstrate the effectiveness and efficiency of PLS-DN on these problems.


2021 ◽  
Author(s):  
Isabella Södergren ◽  
Maryam Pahlavan Nodeh ◽  
Prakash Chandra Chhipa ◽  
Konstantina Nikolaidou ◽  
György Kovács

2021 ◽  
Vol 87 (6) ◽  
pp. 445-455
Author(s):  
Yi Ma ◽  
Zezhong Zheng ◽  
Yutang Ma ◽  
Mingcang Zhu ◽  
Ran Huang ◽  
...  

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.


Sign in / Sign up

Export Citation Format

Share Document