An empirical study on optimization of training dataset in harmfulness prediction of code clone using ensemble feature selection model

Author(s):  
Sheng Yan ◽  
Liping Zhang ◽  
Dongsheng Liu
2021 ◽  
Vol 113 ◽  
pp. 107870
Author(s):  
Md Alamgir Kabir ◽  
Jacky Keung ◽  
Burak Turhan ◽  
Kwabena Ebo Bennin

2021 ◽  
Vol 11 ◽  
Author(s):  
Hongwei Yu ◽  
Xianqi Meng ◽  
Huang Chen ◽  
Jian Liu ◽  
Wenwen Gao ◽  
...  

ObjectivesThis study aimed to investigate whether radiomics classifiers from mammography can help predict tumor-infiltrating lymphocyte (TIL) levels in breast cancer.MethodsData from 121 consecutive patients with pathologically-proven breast cancer who underwent preoperative mammography from February 2018 to May 2019 were retrospectively analyzed. Patients were randomly divided into a training dataset (n = 85) and a validation dataset (n = 36). A total of 612 quantitative radiomics features were extracted from mammograms using the Pyradiomics software. Radiomics feature selection and radiomics classifier were generated through recursive feature elimination and logistic regression analysis model. The relationship between radiomics features and TIL levels in breast cancer patients was explored. The predictive capacity of the radiomics classifiers for the TIL levels was investigated through receiver operating characteristic curves in the training and validation groups. A radiomics score (Rad score) was generated using a logistic regression analysis method to compute the training and validation datasets, and combining the Mann–Whitney U test to evaluate the level of TILs in the low and high groups.ResultsAmong the 121 patients, 32 (26.44%) exhibited high TIL levels, and 89 (73.56%) showed low TIL levels. The ER negativity (p = 0.01) and the Ki-67 negative threshold level (p = 0.03) in the low TIL group was higher than that in the high TIL group. Through the radiomics feature selection, six top-class features [Wavelet GLDM low gray-level emphasis (mediolateral oblique, MLO), GLRLM short-run low gray-level emphasis (craniocaudal, CC), LBP2D GLRLM short-run high gray-level emphasis (CC), LBP2D GLDM dependence entropy (MLO), wavelet interquartile range (MLO), and LBP2D median (MLO)] were selected to constitute the radiomics classifiers. The radiomics classifier had an excellent predictive performance for TIL levels both in the training and validation sets [area under the curve (AUC): 0.83, 95% confidence interval (CI), 0.738–0.917, with positive predictive value (PPV) of 0.913; AUC: 0.79, 95% CI, 0.615–0.964, with PPV of 0.889, respectively]. Moreover, the Rad score in the training dataset was higher than that in the validation dataset (p = 0.007 and p = 0.001, respectively).ConclusionRadiomics from digital mammograms not only predicts the TIL levels in breast cancer patients, but can also serve as non-invasive biomarkers in precision medicine, allowing for the development of treatment plans.


2020 ◽  
pp. 3397-3407
Author(s):  
Nur Syafiqah Mohd Nafis ◽  
Suryanti Awang

Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.


2009 ◽  
Vol 07 (05) ◽  
pp. 773-788 ◽  
Author(s):  
PENG CHEN ◽  
CHUNMEI LIU ◽  
LEGAND BURGE ◽  
MOHAMMAD MAHMOOD ◽  
WILLIAM SOUTHERLAND ◽  
...  

Protein fold classification is a key step to predicting protein tertiary structures. This paper proposes a novel approach based on genetic algorithms and feature selection to classifying protein folds. Our dataset is divided into a training dataset and a test dataset. Each individual for the genetic algorithms represents a selection function of the feature vectors of the training dataset. A support vector machine is applied to each individual to evaluate the fitness value (fold classification rate) of each individual. The aim of the genetic algorithms is to search for the best individual that produces the highest fold classification rate. The best individual is then applied to the feature vectors of the test dataset and a support vector machine is built to classify protein folds based on selected features. Our experimental results on Ding and Dubchak's benchmark dataset of 27-class folds show that our approach achieves an accuracy of 71.28%, which outperforms current state-of-the-art protein fold predictors.


Sign in / Sign up

Export Citation Format

Share Document