Multi-label feature selection based on logistic regression and manifold learning

Author(s):  
Yao Zhang ◽  
Yingcang Ma ◽  
Xiaofei Yang
Author(s):  
Yao Zhang ◽  
Yingcang Ma ◽  
Xiaofei Yang

Like traditional single label learning, multi-label learning is also faced with the problem of dimensional disaster.Feature selection is an effective technique for dimensionality reduction and learning efficiency improvement of high-dimensional data. In this paper, Logistic regression, manifold learning and sparse regularization were combined to construct a joint framework for multi-label feature selection (LMFS). Firstly, the sparsity of the eigenweight matrix is constrained by the $L_{2,1}$-norm. Secondly, the feature manifold and label manifold can constrain the feature weight matrix to make it fit the data information and label information better. An iterative updating algorithm is designed and the convergence of the algorithm is proved.Finally, the LMFS algorithm is compared with DRMFS, SCLS and other algorithms on eight classical multi-label data sets. The experimental results show the effectiveness of LMFS algorithm.


2021 ◽  
Vol 11 ◽  
Author(s):  
Hongwei Yu ◽  
Xianqi Meng ◽  
Huang Chen ◽  
Jian Liu ◽  
Wenwen Gao ◽  
...  

ObjectivesThis study aimed to investigate whether radiomics classifiers from mammography can help predict tumor-infiltrating lymphocyte (TIL) levels in breast cancer.MethodsData from 121 consecutive patients with pathologically-proven breast cancer who underwent preoperative mammography from February 2018 to May 2019 were retrospectively analyzed. Patients were randomly divided into a training dataset (n = 85) and a validation dataset (n = 36). A total of 612 quantitative radiomics features were extracted from mammograms using the Pyradiomics software. Radiomics feature selection and radiomics classifier were generated through recursive feature elimination and logistic regression analysis model. The relationship between radiomics features and TIL levels in breast cancer patients was explored. The predictive capacity of the radiomics classifiers for the TIL levels was investigated through receiver operating characteristic curves in the training and validation groups. A radiomics score (Rad score) was generated using a logistic regression analysis method to compute the training and validation datasets, and combining the Mann–Whitney U test to evaluate the level of TILs in the low and high groups.ResultsAmong the 121 patients, 32 (26.44%) exhibited high TIL levels, and 89 (73.56%) showed low TIL levels. The ER negativity (p = 0.01) and the Ki-67 negative threshold level (p = 0.03) in the low TIL group was higher than that in the high TIL group. Through the radiomics feature selection, six top-class features [Wavelet GLDM low gray-level emphasis (mediolateral oblique, MLO), GLRLM short-run low gray-level emphasis (craniocaudal, CC), LBP2D GLRLM short-run high gray-level emphasis (CC), LBP2D GLDM dependence entropy (MLO), wavelet interquartile range (MLO), and LBP2D median (MLO)] were selected to constitute the radiomics classifiers. The radiomics classifier had an excellent predictive performance for TIL levels both in the training and validation sets [area under the curve (AUC): 0.83, 95% confidence interval (CI), 0.738–0.917, with positive predictive value (PPV) of 0.913; AUC: 0.79, 95% CI, 0.615–0.964, with PPV of 0.889, respectively]. Moreover, the Rad score in the training dataset was higher than that in the validation dataset (p = 0.007 and p = 0.001, respectively).ConclusionRadiomics from digital mammograms not only predicts the TIL levels in breast cancer patients, but can also serve as non-invasive biomarkers in precision medicine, allowing for the development of treatment plans.


Worldwide, breast cancer is the leading type of cancer in women accounting for 25% of all cases. Survival rates in the developed countries are comparatively higher with that of developing countries. This had led to the importance of computer aided diagnostic methods for early detection of breast cancer disease. This eventually reduces the death rate. This paper intents the scope of the biomarker that can be used to predict the breast cancer from the anthropometric data. This experimental study aims at computing and comparing various classification models (Binary Logistic Regression, Ball Vector Machine (BVM), C4.5, Partial Least Square (PLS) for Classification, Classification Tree, Cost sensitive Classification Tree, Cost sensitive Decision Tree, Support Vector Machine for Classification, Core Vector Machine, ID3, K-Nearest Neighbor, Linear Discriminant Analysis (LDA), Log-Reg TRIRLS, Multi Layer Perceptron (MLP), Multinomial Logistic Regression (MLR), Naïve Bayes (NB), PLS for Discriminant Analysis, PLS for LDA, Random Tree (RT), Support Vector Machine SVM) for the UCI Coimbra breast cancer dataset. The feature selection algorithms (Backward Logit, Fisher Filtering, Forward Logit, ReleifF, Step disc) are worked out to find out the minimum attributes that can achieve a better accuracy. To ascertain the accuracy results, the Jack-knife cross validation method for the algorithms is conducted and validated. The Core vector machine classification algorithm outperforms the other nineteen algorithms with an accuracy of 82.76%, sensitivity of 76.92% and specificity of 87.50% for the selected three attributes, Age, Glucose and Resistin using ReleifF feature selection algorithm.


2019 ◽  
Vol 22 (8) ◽  
pp. 1577-1581
Author(s):  
Mohammed Abdulrazaq Kahya ◽  
Suhaib Abduljabbar Altamir ◽  
Zakariya Yahya Algamal

2020 ◽  
Vol 21 (S13) ◽  
Author(s):  
Ke Li ◽  
Sijia Zhang ◽  
Di Yan ◽  
Yannan Bin ◽  
Junfeng Xia

Abstract Background Identification of hot spots in protein-DNA interfaces provides crucial information for the research on protein-DNA interaction and drug design. As experimental methods for determining hot spots are time-consuming, labor-intensive and expensive, there is a need for developing reliable computational method to predict hot spots on a large scale. Results Here, we proposed a new method named sxPDH based on supervised isometric feature mapping (S-ISOMAP) and extreme gradient boosting (XGBoost) to predict hot spots in protein-DNA complexes. We obtained 114 features from a combination of the protein sequence, structure, network and solvent accessible information, and systematically assessed various feature selection methods and feature dimensionality reduction methods based on manifold learning. The results show that the S-ISOMAP method is superior to other feature selection or manifold learning methods. XGBoost was then used to develop hot spots prediction model sxPDH based on the three dimensionality-reduced features obtained from S-ISOMAP. Conclusion Our method sxPDH boosts prediction performance using S-ISOMAP and XGBoost. The AUC of the model is 0.773, and the F1 score is 0.713. Experimental results on benchmark dataset indicate that sxPDH can achieve generally better performance in predicting hot spots compared to the state-of-the-art methods.


Sign in / Sign up

Export Citation Format

Share Document