scholarly journals ExhauFS: exhaustive search-based feature selection for classification and survival regression

2021 ◽  
Author(s):  
Stepan Nersisyan ◽  
Victor Novosad ◽  
Alexei Galatenko ◽  
Andrey Sokolov ◽  
Grigoriy Bokov ◽  
...  

Motivation: Feature selection is one of the main techniques used to prevent overfitting in machine learning applications. The most straightforward approach for feature selection is exhaustive search: one can go over all possible feature combinations and pick up the model with the highest accuracy. This method together with its optimizations were actively used in biomedical research, however, publicly available implementation is missing. Results: We present ExhauFS - the user-friendly command-line implementation of the exhaustive search approach for classification and survival regression. Aside from tool description, we included three application examples in the manuscript to comprehensively review the implemented function-ality. First, we executed ExhauFS on a toy cervical cancer dataset to illustrate basic concepts. Then, a multi-cohort microarray and RNA-seq breast cancer datasets were used to construct gene signatures for 5-year recurrence classification. Finally, Cox survival regression models were used to fit isomiR signatures for overall survival prediction for patients with colorectal cancer. Availability: Source codes and documentation of ExhauFS are available on GitHub: https://github.com/s-a-nersisyan/ExhauFS.

2021 ◽  
pp. 153537022199201
Author(s):  
Runmin Li ◽  
Guosheng Wang ◽  
ZhouJie Wu ◽  
HuaGuang Lu ◽  
Gen Li ◽  
...  

Multiple-omics sequencing information with high-throughput has laid a solid foundation to identify genes associated with cancer prognostic process. Multiomics information study is capable of revealing the cancer occurring and developing system according to several aspects. Currently, the prognosis of osteosarcoma is still poor, so a genetic marker is needed for predicting the clinically related overall survival result. First, Office of Cancer Genomics (OCG Target) provided RNASeq, copy amount variations information, and clinically related follow-up data. Genes associated with prognostic process and genes exhibiting copy amount difference were screened in the training group, and the mentioned genes were integrated for feature selection with least absolute shrinkage and selection operator (Lasso). Eventually, effective biomarkers received the screening process. Lastly, this study built and demonstrated one gene-associated prognosis mode according to the set of the test and gene expression omnibus validation set; 512 prognosis-related genes ( P < 0.01), 336 copies of amplified genes ( P < 0.05), and 36 copies of deleted genes ( P < 0.05) were obtained, and those genes of the mentioned genomic variants display close associations with tumor occurring and developing mechanisms. This study generated 10 genes for candidates through the integration of genomic variant genes as well as prognosis-related genes. Six typical genes (i.e. MYC, CHIC2, CCDC152, LYL1, GPR142, and MMP27) were obtained by Lasso feature selection and stepwise multivariate regression study, many of which are reported to show a relationship to tumor progressing process. The authors conducted Cox regression study for building 6-gene sign, i.e. one single prognosis-related element, in terms of cases carrying osteosarcoma. In addition, the samples were able to be risk stratified in the training group, test set, and externally validating set. The AUC of five-year survival according to the training group and validation set reached over 0.85, with superior predictive performance as opposed to the existing researches. Here, 6-gene sign was built to be new prognosis-related marking elements for assessing osteosarcoma cases’ surviving state.


Energies ◽  
2019 ◽  
Vol 12 (3) ◽  
pp. 453 ◽  
Author(s):  
Pere Marti-Puig ◽  
Alejandro Blanco-M ◽  
Juan Cárdenas ◽  
Jordi Cusidó ◽  
Jordi Solé-Casals

It is well known that each year the wind sector has profit losses due to wind turbine failures and operation and maintenance costs. Therefore, operations related to these actions are crucial for wind farm operators and linked companies. One of the key points for failure prediction on wind turbine using SCADA data is to select the optimal or near optimal set of inputs that can feed the failure prediction (prognosis) algorithm. Due to a high number of possible predictors (from tens to hundreds), the optimal set of inputs obtained by exhaustive-search algorithms is not viable in the majority of cases. In order to tackle this issue, show the viability of prognosis and select the best set of variables from more than 200 analogous variables recorded at intervals of 5 or 10 min by the wind farm’s SCADA, in this paper a thorough study of automatic input selection algorithms for wind turbine failure prediction is presented and an exhaustive-search-based quasi-optimal (QO) algorithm, which has been used as a reference, is proposed. In order to evaluate the performance, a k-NN classification algorithm is used. Results showed that the best automatic feature selection method in our case-study is the conditional mutual information (CMI), while the worst one is the mutual information feature selection (MIFS). Furthermore, the effect of the number of neighbours (k) is tested. Experiments demonstrate that k = 1 is the best option if the number of features is higher than 3. The experiments carried out in this work have been extracted from measures taken along an entire year and corresponding to gearbox and transmission systems of Fuhrländer wind turbines.


2012 ◽  
Vol 39 (16) ◽  
pp. 12332-12339 ◽  
Author(s):  
Federico Cismondi ◽  
Abigail L. Horn ◽  
André S. Fialho ◽  
Susana M. Vieira ◽  
Shane R. Reti ◽  
...  

2020 ◽  
Author(s):  
Tiansheng Zhu ◽  
Guo-Bo Chen ◽  
Chunhui Yuan ◽  
Rui Sun ◽  
Fangfei Zhang ◽  
...  

AbstractBatch effects are unwanted data variations that may obscure biological signals, leading to bias or errors in subsequent data analyses. Effective evaluation and elimination of batch effects are necessary for omics data analysis. In order to facilitate the evaluation and correction of batch effects, here we present BatchSever, an open-source R/Shiny based user-friendly interactive graphical web platform for batch effects analysis. In BatchServer we introduced autoComBat, a modified version of ComBat, which is the most widely adopted tool for batch effect correction. BatchServer uses PVCA (Principal Variance Component Analysis) and UMAP (Manifold Approximation and Projection) for evaluation and visualizion of batch effects. We demonstate its application in multiple proteomics and transcriptomic data sets. BatchServer is provided at https://lifeinfo.shinyapps.io/batchserver/ as a web server. The source codes are freely available at https://github.com/guomics-lab/batch_server.


2021 ◽  
Vol 2129 (1) ◽  
pp. 012022
Author(s):  
Mohamad Faiz Dzulkalnine ◽  
Roselina Sallehuddin ◽  
Yusliza Yussof ◽  
Nor Haizan Mohd Radzi ◽  
Noorfa Haszlinna Binti Mustaffa ◽  
...  

Abstract In Malaysia, Colorectal Cancer (CRC) is one of the most common cancers that occur in both men and women. Early detection is very crucial and it can significantly increase the rate of survival for the patients and if left untreated can lead to death. With the lack of high-quality CRC data, expert systems and machine learning analysis are burdened with the presence of irrelevant features, outliers, and noise. This can reduce the classification accuracy for data analysis. Accordingly, it is essential to find a reliable feature selection method that can identify and remove any irrelevant feature while being resistant to noise and outliers. In this paper, Fuzzy Principal Component Analysis (FPCA) was tested for the classification of Malaysian’s CRC dataset. With the utilization of fuzzy membership in FPCA, the experimental results showed that the proposed method produces higher accuracy compared to PCA and SVM by almost 2% and 5% respectively. Empirical results showed that FPCA is a reliable feature selection method that can find the most informative features in the CRC dataset that could assist medical practitioners in making an informed decision.


2018 ◽  
Author(s):  
Sebastià Franch-Expósito ◽  
Laia Bassaganyas ◽  
Maria Vila-Casadesús ◽  
Eva Hernández-Illán ◽  
Roger Esteban-Fabró ◽  
...  

ABSTRACTSomatic copy number alterations (CNAs) are a hallmark of cancer. Although CNA profiles have been established for most human tumor types, their precise role in tumorigenesis as well as their clinical and therapeutic relevance remain largely unclear. Thus, computational and statistical approaches are required to thoroughly define the interplay between CNAs and tumor phenotypes. Here we developed CNApp, a user-friendly web tool that offers sample- and cohort-level computational analyses, allowing a comprehensive and integrative exploration of CNAs with clinical and molecular variables. By using purity-corrected segmented data from multiple genomic platforms, CNApp generates genome-wide profiles, computes CNA scores for broad, focal and global CNA burdens, and uses machine learning-based predictions to classify samples. We applied CNApp to a pan-cancer dataset of 10,635 genomes from TCGA showing that CNA patterns classify cancer types according to their tissue-of-origin, and that broad and focal CNA scores positively correlate in samples with low amounts of whole-chromosome and chromosomal arm-level imbalances. Moreover, using the hepatocellular carcinoma cohort from the TCGA repository, we demonstrate the reliability of the tool in identifying recurrent CNAs, confirming previous results. Finally, we establish machine learning-based models to predict colon cancer molecular subtypes and microsatellite instability based on broad CNA scores and specific genomic imbalances. In summary, CNApp facilitates data-driven research and provides a unique framework for the first time to comprehensively assess CNAs and perform integrative analyses that enable the identification of relevant clinical implications. CNApp is hosted at http://cnapp.bsc.es.


Sign in / Sign up

Export Citation Format

Share Document