Stochastic variational variable selection for high-dimensional microbiome data

2021 ◽  
Author(s):  
Tung Dang ◽  
Kie Kumaishi ◽  
Erika Usui ◽  
Shungo Kobori ◽  
Takumi Sato ◽  
...  

AbstractBackgroundThe rapid and accurate identification of a minimal-size core set of representative microbial species plays an important role in the clustering of microbial community data and interpretation of the clustering results. However, the huge dimensionality of microbial metagenomics data sets is a major challenge for the existing methods such as Dirichlet multinomial mixture (DMM) models. In the framework of the existing methods, computational burdens for identifying a small number of representative species from a huge number of observed species remain a challenge.ResultsWe proposed a novel framework to improve the performance of the widely used DMM approach by combining three ideas: (i) We extended the finite DMM model to the infinite case, via the consideration of Dirichlet process mixtures and estimate the number of clusters as a random variables. (ii) We proposed an indicator variable to identify representative operational taxonomic units that substantially contribute to the differentiation among clusters. (iii) To address the computational burdens of the high-dimensional microbiome data, we proposed are a stochastic variational inference, which approximates the posterior distribution using a controllable distribution called variational distribution, and stochastic optimization algorithms for fast computation. With the proposed method named stochastic variational variable selection (SVVS), we analyzed the root microbiome data collected in our soybean field experiment and the human gut microbiome data from three published data sets of large-scale case-control studies.ConclusionsSVVS demonstrated a better performance and significantly faster computation than existing methods in all cases of testing data sets. In particular, SVVS is the only method that can analyze the massive high-dimensional microbial data with above 50,000 microbial species and 1,000 samples. Furthermore, it was suggested that microbial species selected as a core set played important roles in the recent microbiome studies.

2019 ◽  
Vol 9 (14) ◽  
pp. 2841 ◽  
Author(s):  
Nan Zhang ◽  
Xueyi Gao ◽  
Tianyou Yu

Attribute reduction is a challenging problem in rough set theory, which has been applied in many research fields, including knowledge representation, machine learning, and artificial intelligence. The main objective of attribute reduction is to obtain a minimal attribute subset that can retain the same classification or discernibility properties as the original information system. Recently, many attribute reduction algorithms, such as positive region preservation, generalized decision preservation, and distribution preservation, have been proposed. The existing attribute reduction algorithms for generalized decision preservation are mainly based on the discernibility matrix and are, thus, computationally very expensive and hard to use in large-scale and high-dimensional data sets. To overcome this problem, we introduce the similarity degree for generalized decision preservation. On this basis, the inner and outer significance measures are proposed. By using heuristic strategies, we develop two quick reduction algorithms for generalized decision preservation. Finally, theoretical and experimental results show that the proposed heuristic reduction algorithms are effective and efficient.


Author(s):  
DIANXUN SHUAI ◽  
XUE FANGLIANG

Data clustering has been widely used in many areas, such as data mining, statistics, machine learning and so on. A variety of clustering approaches have been proposed so far, but most of them are not qualified to quickly cluster a large-scale high-dimensional database. This paper is devoted to a novel data clustering approach based on a generalized particle model (GPM). The GPM transforms the data clustering process into a stochastic process over the configuration space on a GPM array. The proposed approach is characterized by the self-organizing clustering and many advantages in terms of the insensitivity to noise, quality robustness to clustered data, suitability for high-dimensional and massive data sets, learning ability, openness and easier hardware implementation with the VLSI systolic technology. The analysis and simulations have shown the effectiveness and good performance of the proposed GPM approach to data clustering.


2013 ◽  
Vol 444-445 ◽  
pp. 604-609
Author(s):  
Guang Hui Fu ◽  
Pan Wang

LASSO is a very useful variable selection method for high-dimensional data , But it does not possess oracle property [Fan and Li, 200 and group effect [Zou and Hastie, 200. In this paper, we firstly review four improved LASSO-type methods which satisfy oracle property and (or) group effect, and then give another two new ones called WFEN and WFAEN. The performance on both the simulation and real data sets shows that WFEN and WFAEN are competitive with other LASSO-type methods.


2019 ◽  
Vol 30 (3) ◽  
pp. 697-719 ◽  
Author(s):  
Fan Wang ◽  
Sach Mukherjee ◽  
Sylvia Richardson ◽  
Steven M. Hill

AbstractPenalized likelihood approaches are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well developed, the relative efficacy of different approaches in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users. In this paper, we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 2300 data-generating scenarios, including both synthetic and semisynthetic data (real covariates and simulated responses), allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely used approaches (Lasso, Adaptive Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector and Stability Selection). We find considerable variation in performance between methods. Our results support a “no panacea” view, with no unambiguous winner across all scenarios or goals, even in this restricted setting where all data align well with the assumptions underlying the methods. The study allows us to make some recommendations as to which approaches may be most (or least) suitable given the goal and some data characteristics. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics.


2020 ◽  
Author(s):  
Tung Dang ◽  
Hirohisa Kishino

AbstractA central focus of microbiome studies is the characterization of differences in the microbiome composition across groups of samples. A major challenge is the high dimensionality of microbiome datasets, which significantly reduces the power of current approaches for identifying true differences and increases the chance of false discoveries. We have developed a new framework to address these issues by combining (i) identifying a few significant features by a massively parallel forward variable selection procedure, (ii) mapping the selected species on a phylogenetic tree, and (iii) predicting functional profiles by functional gene enrichment analysis from metagenomic 16S rRNA data. We demonstrated the performance of the proposed approach by analyzing two published datasets from large-scale case-control studies: (i) 16S rRNA gene amplicon data for Clostridioides difficile infection (CDI) and (ii) shotgun metagenomics data for human colorectal cancer (CRC). The proposed approach improved the accuracy from 81% to 99.01% for CDI and from 75.14% to 90.17% for CRC. We identified a core set of 96 species that were significantly enriched in CDI and a core set of 75 species that were enriched in CRC. Moreover, although the quality of the data differed for the functional profiles predicted from the 16S rRNA dataset and functional metagenome profiling, our approach performed well for both databases and detected main functions that can be used to diagnose and study further the growth stage of diseases.Supplementary informationHirohisa Kishino: [email protected] Dang: [email protected]


Methodology ◽  
2020 ◽  
Vol 16 (2) ◽  
pp. 127-146 ◽  
Author(s):  
Seung Hyun Baek ◽  
Alberto Garcia-Diaz ◽  
Yuanshun Dai

Data mining is one of the most effective statistical methodologies to investigate a variety of problems in areas including pattern recognition, machine learning, bioinformatics, chemometrics, and statistics. In particular, statistically-sophisticated procedures that emphasize on reliability of results and computational efficiency are required for the analysis of high-dimensional data. Optimization principles can play a significant role in the rationalization and validation of specialized data mining procedures. This paper presents a novel methodology which is Multi-Choice Wavelet Thresholding (MCWT) based three-step methodology consists of three processes: perception (dimension reduction), decision (feature ranking), and cognition (model selection). In these steps three concepts known as wavelet thresholding, support vector machines for classification and information complexity are integrated to evaluate learning models. Three published data sets are used to illustrate the proposed methodology. Additionally, performance comparisons with recent and widely applied methods are shown.


2017 ◽  
Author(s):  
Sahir Rai Bhatnagar ◽  
Yi Yang ◽  
Budhachandra Khundrakpam ◽  
Alan C Evans ◽  
Mathieu Blanchette ◽  
...  

AbstractPredicting a phenotype and understanding which variables improve that prediction are two very challenging and overlapping problems in analysis of high-dimensional data such as those arising from genomic and brain imaging studies. It is often believed that the number of truly important predictors is small relative to the total number of variables, making computational approaches to variable selection and dimension reduction extremely important. To reduce dimensionality, commonly-used two-step methods first cluster the data in some way, and build models using cluster summaries to predict the phenotype.It is known that important exposure variables can alter correlation patterns between clusters of high-dimensional variables, i.e., alter network properties of the variables. However, it is not well understood whether such altered clustering is informative in prediction. Here, assuming there is a binary exposure with such network-altering effects, we explore whether use of exposure-dependent clustering relationships in dimension reduction can improve predictive modelling in a two-step framework. Hence, we propose a modelling framework called ECLUST to test this hypothesis, and evaluate its performance through extensive simulations.With ECLUST, we found improved prediction and variable selection performance compared to methods that do not consider the environment in the clustering step, or to methods that use the original data as features. We further illustrate this modelling framework through the analysis of three data sets from very different fields, each with high dimensional data, a binary exposure, and a phenotype of interest. Our method is available in the eclust CRAN package.


2021 ◽  
Author(s):  
Steven Marc Weisberg ◽  
Victor Roger Schinazi ◽  
Andrea Ferrario ◽  
Nora Newcombe

Relying on shared tasks and stimuli to conduct research can enhance the replicability of findings and allow a community of researchers to collect large data sets across multiple experiments. This approach is particularly relevant for experiments in spatial navigation, which often require the development of unfamiliar large-scale virtual environments to test participants. One challenge with shared platforms is that undetected technical errors, rather than being restricted to individual studies, become pervasive across many studies. Here, we discuss the discovery of a programming error (a bug) in a virtual environment platform used to investigate individual differences in spatial navigation: Virtual Silcton. The bug resulted in storing the absolute value of an angle in a pointing task rather than the signed angle. This bug was difficult to detect for several reasons, and it rendered the original sign of the angle unrecoverable. To assess the impact of the error on published findings, we collected a new data set for comparison. Our results revealed that the effect of the error on published data is likely to be minimal, partially explaining the difficulty in detecting the bug over the years. We also used the new data set to develop a tool that allows researchers who have previously used Virtual Silcton to evaluate the impact of the bug on their findings. We summarize the ways that shared open materials, shared data, and collaboration can pave the way for better science to prevent errors in the future.


Sign in / Sign up

Export Citation Format

Share Document