A Holistic Classification Optimization Framework with Feature Selection, Preprocessing, Manifold Learning and Classifiers

Author(s):  
Fabian Bürger ◽  
Josef Pauli
2020 ◽  
Vol 21 (S13) ◽  
Author(s):  
Ke Li ◽  
Sijia Zhang ◽  
Di Yan ◽  
Yannan Bin ◽  
Junfeng Xia

Abstract Background Identification of hot spots in protein-DNA interfaces provides crucial information for the research on protein-DNA interaction and drug design. As experimental methods for determining hot spots are time-consuming, labor-intensive and expensive, there is a need for developing reliable computational method to predict hot spots on a large scale. Results Here, we proposed a new method named sxPDH based on supervised isometric feature mapping (S-ISOMAP) and extreme gradient boosting (XGBoost) to predict hot spots in protein-DNA complexes. We obtained 114 features from a combination of the protein sequence, structure, network and solvent accessible information, and systematically assessed various feature selection methods and feature dimensionality reduction methods based on manifold learning. The results show that the S-ISOMAP method is superior to other feature selection or manifold learning methods. XGBoost was then used to develop hot spots prediction model sxPDH based on the three dimensionality-reduced features obtained from S-ISOMAP. Conclusion Our method sxPDH boosts prediction performance using S-ISOMAP and XGBoost. The AUC of the model is 0.773, and the F1 score is 0.713. Experimental results on benchmark dataset indicate that sxPDH can achieve generally better performance in predicting hot spots compared to the state-of-the-art methods.


Author(s):  
Jia Zhang ◽  
Yidong Lin ◽  
Min Jiang ◽  
Shaozi Li ◽  
Yong Tang ◽  
...  

Information theoretical based methods have attracted a great attention in recent years, and gained promising results to deal with multi-label data with high dimensionality. However, most of the existing methods are either directly transformed from heuristic single-label feature selection methods or inefficient in exploiting labeling information. Thus, they may not be able to get an optimal feature selection result shared by multiple labels. In this paper, we propose a general global optimization framework, in which feature relevance, label relevance (i.e., label correlation), and feature redundancy are taken into account, thus facilitating multi-label feature selection. Moreover, the proposed method has an excellent mechanism for utilizing inherent properties of multi-label learning. Specially, we provide a formulation to extend the proposed method with label-specific features. Empirical studies on twenty multi-label data sets reveal the effectiveness and efficiency of the proposed method. Our implementation of the proposed method is available online at: https://jiazhang-ml.pub/GRRO-master.zip.


Author(s):  
Yao Zhang ◽  
Yingcang Ma ◽  
Xiaofei Yang

Like traditional single label learning, multi-label learning is also faced with the problem of dimensional disaster.Feature selection is an effective technique for dimensionality reduction and learning efficiency improvement of high-dimensional data. In this paper, Logistic regression, manifold learning and sparse regularization were combined to construct a joint framework for multi-label feature selection (LMFS). Firstly, the sparsity of the eigenweight matrix is constrained by the $L_{2,1}$-norm. Secondly, the feature manifold and label manifold can constrain the feature weight matrix to make it fit the data information and label information better. An iterative updating algorithm is designed and the convergence of the algorithm is proved.Finally, the LMFS algorithm is compared with DRMFS, SCLS and other algorithms on eight classical multi-label data sets. The experimental results show the effectiveness of LMFS algorithm.


Sign in / Sign up

Export Citation Format

Share Document