scholarly journals Robust biomarker discovery for hepatocellular carcinoma from high-throughput data by multiple feature selection methods

2021 ◽  
Vol 14 (S1) ◽  
Author(s):  
Zishuang Zhang ◽  
Zhi-Ping Liu

Abstract Background Hepatocellular carcinoma (HCC) is one of the most common cancers. The discovery of specific genes severing as biomarkers is of paramount significance for cancer diagnosis and prognosis. The high-throughput omics data generated by the cancer genome atlas (TCGA) consortium provides a valuable resource for the discovery of HCC biomarker genes. Numerous methods have been proposed to select cancer biomarkers. However, these methods have not investigated the robustness of identification with different feature selection techniques. Methods We use six different recursive feature elimination methods to select the gene signiatures of HCC from TCGA liver cancer data. The genes shared in the six selected subsets are proposed as robust biomarkers. Akaike information criterion (AIC) is employed to explain the optimization process of feature selection, which provides a statistical interpretation for the feature selection in machine learning methods. And we use several methods to validate the screened biomarkers. Results In this paper, we propose a robust method for discovering biomarker genes for HCC from gene expression data. Specifically, we implement recursive feature elimination cross-validation (RFE-CV) methods based on six different classication algorithms. The overlaps in the discovered gene sets via different methods are referred as the identified biomarkers. We give an interpretation of the feature selection process based on machine learning using AIC in statistics. Furthermore, the features selected by the backward logistic stepwise regression via AIC minimum theory are completely contained in the identified biomarkers. Through the classification results, the superiority of interpretable robust biomarker discovery method is verified. Conclusions It is found that overlaps among gene subsets contain different quantitative features selected by the RFE-CV of 6 classifiers. The AIC values in the model selection provide a theoretical foundation for the feature selection process of biomarker discovery via machine learning. What’s more, genes containing in more optimally selected subsets make better biological sense and implication. The quality of feature selection is improved by the intersections of biomarkers selected from different classifiers. This is a general method suitable for screening biomarkers of complex diseases from high-throughput data.

2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Nicholas Nuechterlein ◽  
Beibin Li ◽  
Abdullah Feroze ◽  
Eric C Holland ◽  
Linda Shapiro ◽  
...  

Abstract Background Combined whole-exome sequencing (WES) and somatic copy number alteration (SCNA) information can separate isocitrate dehydrogenase (IDH)1/2-wildtype glioblastoma into two prognostic molecular subtypes, which cannot be distinguished by epigenetic or clinical features. The potential for radiographic features to discriminate between these molecular subtypes has yet to be established. Methods Radiologic features (n = 35 340) were extracted from 46 multisequence, pre-operative magnetic resonance imaging (MRI) scans of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive (TCIA), all of whom have corresponding WES/SCNA data. We developed a novel feature selection method that leverages the structure of extracted MRI features to mitigate the dimensionality challenge posed by the disparity between a large number of features and the limited patients in our cohort. Six traditional machine learning classifiers were trained to distinguish molecular subtypes using our feature selection method, which was compared to least absolute shrinkage and selection operator (LASSO) feature selection, recursive feature elimination, and variance thresholding. Results We were able to classify glioblastomas into two prognostic subgroups with a cross-validated area under the curve score of 0.80 (±0.03) using ridge logistic regression on the 15-dimensional principle component analysis (PCA) embedding of the features selected by our novel feature selection method. An interrogation of the selected features suggested that features describing contours in the T2 signal abnormality region on the T2-weighted fluid-attenuated inversion recovery (FLAIR) MRI sequence may best distinguish these two groups from one another. Conclusions We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups.


2017 ◽  
Vol 2017 ◽  
pp. 1-9 ◽  
Author(s):  
Ze-ying Wu ◽  
Zhong-da Zeng ◽  
Zi-dan Xiao ◽  
Daniel Kam-Wah Mok ◽  
Yi-zeng Liang ◽  
...  

The rapid increase in the use of metabolite profiling/fingerprinting techniques to resolve complicated issues in metabolomics has stimulated demand for data processing techniques, such as alignment, to extract detailed information. In this study, a new and automated method was developed to correct the retention time shift of high-dimensional and high-throughput data sets. Information from the target chromatographic profiles was used to determine the standard profile as a reference for alignment. A novel, piecewise data partition strategy was applied for the determination of the target components in the standard profile as markers for alignment. An automated target search (ATS) method was proposed to find the exact retention times of the selected targets in other profiles for alignment. The linear interpolation technique (LIT) was employed to align the profiles prior to pattern recognition, comprehensive comparison analysis, and other data processing steps. In total, 94 metabolite profiles of ginseng were studied, including the most volatile secondary metabolites. The method used in this article could be an essential step in the extraction of information from high-throughput data acquired in the study of systems biology, metabolomics, and biomarker discovery.


2021 ◽  
Author(s):  
Isaac Shiri ◽  
Yazdan Salimi ◽  
Abdollah Saberi ◽  
Masoumeh Pakbin ◽  
Ghasem Hajianfar ◽  
...  

AbstractPurposeTo derive and validate an effective radiomics-based model for differentiation of COVID-19 pneumonia from other lung diseases using a very large cohort of patients.MethodsWe collected 19 private and 5 public datasets, accumulating to 26,307 individual patient images (15,148 COVID-19; 9,657 with other lung diseases e.g. non-COVID-19 pneumonia, lung cancer, pulmonary embolism; 1502 normal cases). Images were automatically segmented using a validated deep learning (DL) model and the results carefully reviewed. Images were first cropped into lung-only region boxes, then resized to 296×216 voxels. Voxel dimensions was resized to 1×1×1mm3 followed by 64-bin discretization. The 108 extracted features included shape, first-order histogram and texture features. Univariate analysis was first performed using simple logistic regression. The thresholds were fixed in the training set and then evaluation performed on the test set. False discovery rate (FDR) correction was applied to the p-values. Z-Score normalization was applied to all features. For multivariate analysis, features with high correlation (R2>0.99) were eliminated first using Pearson correlation. We tested 96 different machine learning strategies through cross-combining 4 feature selectors or 8 dimensionality reduction techniques with 8 classifiers. We trained and evaluated our models using 3 different datasets: 1) the entire dataset (26,307 patients: 15,148 COVID-19; 11,159 non-COVID-19); 2) excluding normal patients in non-COVID-19, and including only RT-PCR positive COVID-19 cases in the COVID-19 class (20,697 patients including 12,419 COVID-19, and 8,278 non-COVID-19)); 3) including only non-COVID-19 pneumonia patients and a random sample of COVID-19 patients (5,582 patients: 3,000 COVID-19, and 2,582 non-COVID-19) to provide balanced classes. Subsequently, each of these 3 datasets were randomly split into 70% and 30% for training and testing, respectively. All various steps, including feature preprocessing, feature selection, and classification, were performed separately in each dataset. Classification algorithms were optimized during training using grid search algorithms. The best models were chosen by a one-standard-deviation rule in 10-fold cross-validation and then were evaluated on the test sets.ResultsIn dataset #1, Relief feature selection and RF classifier combination resulted in the highest performance (Area under the receiver operating characteristic curve (AUC) = 0.99, sensitivity = 0.98, specificity = 0.94, accuracy = 0.96, positive predictive value (PPV) = 0.96, and negative predicted value (NPV) = 0.96). In dataset #2, Recursive Feature Elimination (RFE) feature selection and Random Forest (RF) classifier combination resulted in the highest performance (AUC = 0.99, sensitivity = 0.98, specificity = 0.95, accuracy = 0.97, PPV = 0.96, and NPV = 0.98). In dataset #3, the ANOVA feature selection and RF classifier combination resulted in the highest performance (AUC = 0.98, sensitivity = 0.96, specificity = 0.93, accuracy = 0.94, PPV = 0.93, NPV = 0.96).ConclusionRadiomic features extracted from entire lung combined with machine learning algorithms can enable very effective, routine diagnosis of COVID-19 pneumonia from CT images without the use of any other diagnostic test.


2022 ◽  
Author(s):  
Sahan M. Vijithananda ◽  
Mohan L. Jayatilake ◽  
Badra Hewavithana ◽  
Teresa Gonçalves ◽  
Luis M. Rato ◽  
...  

Abstract Background: Diffusion-weighted (DW) imaging is a well-recognized magnetic resonance imaging (MRI) technique that is being routinely used in brain examinations in modern clinical radiology practices. This study focuses on extracting demographic and texture features from MRI Apparent Diffusion Coefficient (ADC) images of human brain tumors, identifying the distribution patterns of each feature and applying Machine Learning (ML) techniques to differentiate malignant from benign brain tumors.Methods: This prospective study was carried out using 1599 labeled MRI brain ADC image slices, 995 malignant, 604 benign from 195 patients who were radiologically diagnosed and histopathologically confirmed as brain tumor patients.The demographics, mean pixel values, skewness, kurtosis, features of Grey Level Co-occurrence Matrix (GLCM), mean, variance, energy, entropy, contrast, homogeneity, correlation, prominence and shade, were extracted from MRI ADC images of each patient.At the feature selection phase, the validity of the extracted features were measured using ANOVA f-test. Then, these features were used as input to several Machine Learning classification algorithms and the respective models were assessed.Results: According to the results of ANOVA f-test feature selection process, two attributes: skewness (3.34) and GLCM homogeneity (3.45) scored the lowest ANOVA f-test scores. Therefore both features were excluded in continuation of the experiment. From the different tested ML algorithms, the Random Forest classifier was chosen to build the final ML model since it presented the highest accuracy. The final model was able to predict malignant and benign neoplasms with an 90.41% accuracy after the hyper parameter tuning process.Conclusion: This study concludes that the above mentioned features (except skewness and GLCM homogeneity) are informative to identify and differentiate malignant from benign brain tumors. Moreover, they enable the development of a high-performance ML model that has the ability to assist in the decision-making steps of brain tumor diagnosis process, prior to attempting invasive diagnostic procedures such as brain biopsies.


2019 ◽  
Vol 12 (4) ◽  
pp. 317-328 ◽  
Author(s):  
Rajalakshmi Krishnamurthi ◽  
Niyati Aggrawal ◽  
Lokendra Sharma ◽  
Diva Srivastava ◽  
Shivangi Sharma

Background: Breast cancer is one of the most common forms of cancers among women and the leading cause of death among them. Countries like United States, England and Canada have reported a high number of breast cancer patients every year and this number is continuously increasing due to detection at later stages. Hence, it is very important to create awareness among women and develop such algorithms which help to detect malignant cancer. Several research studies have been conducted to analyze the breast cancer data. Objective: This paper presents an effective method in predicting breast cancer and its stage and will also analyze the performance of different supervised learning algorithms such as Random Classifier, Chi2 Square test used in order to predict. The paper focuses on the three important aspects such as the feature selection, the corresponding data visualisation and finally making a prediction call on different machine learning models. Methods: The dataset used for this work is breast cancer Wisconsin data taken from UCI library. The dataset has been used to show the different 32 features which are all important and how it can be achieved using data visualisation. Secondly, after the feature selection, different machine learning models have been applied. Conclusion: The machine learning models involved are namely Support Vector Machine (SVM), KNearest Neighbour (KNN), Random Forest, Principal Component Analysis (PCA), Neural Network using Perceptron (NNP). This has been done to check which type of model is better under what conditions. At different stages several charts have been plotted and eliminated based on relative comparison. Results have shown that Random Tree classifier along with Chi2 Square proves to be an efficient one.


2021 ◽  
Vol 13 (14) ◽  
pp. 2833
Author(s):  
Xing Wei ◽  
Marcela A. Johnson ◽  
David B. Langston ◽  
Hillary L. Mehl ◽  
Song Li

Hyperspectral sensors combined with machine learning are increasingly utilized in agricultural crop systems for diverse applications, including plant disease detection. This study was designed to identify the most important wavelengths to discriminate between healthy and diseased peanut (Arachis hypogaea L.) plants infected with Athelia rolfsii, the causal agent of peanut stem rot, using in-situ spectroscopy and machine learning. In greenhouse experiments, daily measurements were conducted to inspect disease symptoms visually and to collect spectral reflectance of peanut leaves on lateral stems of plants mock-inoculated and inoculated with A. rolfsii. Spectrum files were categorized into five classes based on foliar wilting symptoms. Five feature selection methods were compared to select the top 10 ranked wavelengths with and without a custom minimum distance of 20 nm. Recursive feature elimination methods outperformed the chi-square and SelectFromModel methods. Adding the minimum distance of 20 nm into the top selected wavelengths improved classification performance. Wavelengths of 501–505, 690–694, 763 and 884 nm were repeatedly selected by two or more feature selection methods. These selected wavelengths can be applied in designing optical sensors for automated stem rot detection in peanut fields. The machine-learning-based methodology can be adapted to identify spectral signatures of disease in other plant-pathogen systems.


PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0254720
Author(s):  
Maritza Mera-Gaona ◽  
Ursula Neumann ◽  
Rubiel Vargas-Canas ◽  
Diego M. López

Handling missing values is a crucial step in preprocessing data in Machine Learning. Most available algorithms for analyzing datasets in the feature selection process and classification or estimation process analyze complete datasets. Consequently, in many cases, the strategy for dealing with missing values is to use only instances with full data or to replace missing values with a mean, mode, median, or a constant value. Usually, discarding missing samples or replacing missing values by means of fundamental techniques causes bias in subsequent analyzes on datasets. Aim: Demonstrate the positive impact of multivariate imputation in the feature selection process on datasets with missing values. Results: We compared the effects of the feature selection process using complete datasets, incomplete datasets with missingness rates between 5 and 50%, and imputed datasets by basic techniques and multivariate imputation. The feature selection algorithms used are well-known methods. The results showed that the datasets imputed by multivariate imputation obtained the best results in feature selection compared to datasets imputed by basic techniques or non-imputed incomplete datasets. Conclusions: Considering the results obtained in the evaluation, applying multivariate imputation by MICE reduces bias in the feature selection process.


Sign in / Sign up

Export Citation Format

Share Document