Comments on: Augmenting the bootstrap to analyze high dimensional genomic data

When data exhibit imbalance between a large number d of covariates and a small number n of samples, clinical outcome prediction is impaired by overfitting and prohibitive computation demands. Here we study two simple Bayesian prediction protocols that can be applied to data of any dimension and any number of outcome classes. Calculating Bayesian integrals and optimal hyperparameters analytically leaves only a small number of numerical integrations, and CPU demands scale as O(nd). We compare their performance on synthetic and genomic data to the mclustDA method of Fraley and Raftery. For small d they perform as well as mclustDA or better. For d = 10,000 or more mclustDA breaks down computationally, while the Bayesian methods remain efficient. This allows us to explore phenomena typical of classification in high-dimensional spaces, such as overfitting and the reduced discriminative effectiveness of signatures compared to intra-class variability.

Download Full-text

FIFS: A data mining method for informative marker selection in high dimensional population genomic data

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2017.09.020 ◽

2017 ◽

Vol 90 ◽

pp. 146-154 ◽

Cited By ~ 7

Author(s):

Ioannis Kavakiotis ◽

Patroklos Samaras ◽

Alexandros Triantafyllidis ◽

Ioannis Vlahavas

Keyword(s):

Data Mining ◽

Genomic Data ◽

High Dimensional ◽

Mining Method ◽

Informative Marker ◽

Data Mining Method ◽

Marker Selection ◽

Population Genomic

Download Full-text

High-dimensional genomic data bias correction and data integration using MANCIE

Nature Communications ◽

10.1038/ncomms11305 ◽

2016 ◽

Vol 7 (1) ◽

Cited By ~ 20

Author(s):

Chongzhi Zang ◽

Tao Wang ◽

Ke Deng ◽

Bo Li ◽

Sheng’en Hu ◽

...

Keyword(s):

Data Integration ◽

Bias Correction ◽

Genomic Data ◽

High Dimensional

Download Full-text

Deep learning for predicting disease status using genomic data

10.7287/peerj.preprints.27123 ◽

2018 ◽

Cited By ~ 1

Author(s):

Qianfan Wu ◽

Adel Boueiz ◽

Alican Bozkurt ◽

Arya Masoomi ◽

Allan Wang ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Rapid Development ◽

Learning Algorithms ◽

Genomic Data ◽

Disease Status ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Learning Approach ◽

Low Dimensional

Predicting disease status for a complex human disease using genomic data is an important, yet challenging, step in personalized medicine. Among many challenges, the so-called curse of dimensionality problem results in unsatisfied performances of many state-of-art machine learning algorithms. A major recent advance in machine learning is the rapid development of deep learning algorithms that can efficiently extract meaningful features from high-dimensional and complex datasets through a stacked and hierarchical learning process. Deep learning has shown breakthrough performance in several areas including image recognition, natural language processing, and speech recognition. However, the performance of deep learning in predicting disease status using genomic datasets is still not well studied. In this article, we performed a review on the four relevant articles that we found through our thorough literature review. All four articles used auto-encoders to project high-dimensional genomic data to a low dimensional space and then applied the state-of-the-art machine learning algorithms to predict disease status based on the low-dimensional representations. This deep learning approach outperformed existing prediction approaches, such as prediction based on probe-wise screening and prediction based on principal component analysis. The limitations of the current deep learning approach and possible improvements were also discussed.

Download Full-text

Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data

Bioinformatics ◽

10.1093/bioinformatics/bty750 ◽

2018 ◽

Vol 35 (7) ◽

pp. 1181-1187 ◽

Cited By ~ 22

Author(s):

Haohan Wang ◽

Benjamin J Lengerich ◽

Bryon Aragam ◽

Eric P Xing

Keyword(s):

Genomic Data ◽

High Dimensional

Download Full-text

Analysis of high-dimensional genomic data employing a novel bio-inspired algorithm

Applied Soft Computing ◽

10.1016/j.asoc.2019.01.007 ◽

2019 ◽

Vol 77 ◽

pp. 520-532 ◽

Cited By ~ 15

Author(s):

Santos Kumar Baliarsingh ◽

Swati Vipsita ◽

Khan Muhammad ◽

Bodhisattva Dash ◽

Sambit Bakshi

Keyword(s):

Genomic Data ◽

High Dimensional

Download Full-text

A Review on Methods for Detecting SNP Interactions in High-Dimensional Genomic Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2016.2635125 ◽

2018 ◽

Vol 15 (2) ◽

pp. 599-612 ◽

Cited By ~ 18

Author(s):

Suneetha Uppu ◽

Aneesh Krishna ◽

Raj P. Gopalan

Keyword(s):

Genomic Data ◽

High Dimensional

Download Full-text

Stable Variable Selection for High-dimensional Genomic Data with Strong Correlations

10.21203/rs.3.rs-923319/v1 ◽

2021 ◽

Author(s):

Reetika Sarkar ◽

Sithija Manage ◽

Xiaoli Gao

Keyword(s):

Variable Selection ◽

Genomic Data ◽

High Dimensional ◽

Strong Correlations ◽

Computationally Efficient ◽

Two Stage ◽

Selection Of Variables ◽

Level Variable ◽

Stable Variable ◽

Selection Of

Abstract Background: High-dimensional genomic data studies are often found to exhibit strong correlations, which results in instability and inconsistency in the estimates obtained using commonly used regularization approaches including both the Lasso and MCP, and related methods. Result: In this paper, we perform a comparative study of regularization approaches for variable selection under different correlation structures, and propose a two-stage procedure named rPGBS to address the issue of stable variable selection in various strong correlation settings. This approach involves repeatedly running of a two-stage hierarchical approach consisting of a random pseudo-group clustering and bi-level variable selection. Conclusion: Both the simulation studies and high-dimensional genomic data analysis have demonstrated the advantage of the proposed rPGBS method over most commonly used regularization methods. In particular, the rPGBS results in more stable selection of variables across a variety of correlation settings, as compared to recent work addressing variable selection with strong correlations. Moreover, the rPGBS is computationally efficient across various settings.

Download Full-text