scholarly journals Meta-Analyzing Multiple Omics Data With Robust Variable Selection

2021 ◽  
Vol 12 ◽  
Author(s):  
Zongliang Hu ◽  
Yan Zhou ◽  
Tiejun Tong

High-throughput omics data are becoming more and more popular in various areas of science. Given that many publicly available datasets address the same questions, researchers have applied meta-analysis to synthesize multiple datasets to achieve more reliable results for model estimation and prediction. Due to the high dimensionality of omics data, it is also desirable to incorporate variable selection into meta-analysis. Existing meta-analyzing variable selection methods are often sensitive to the presence of outliers, and may lead to missed detections of relevant covariates, especially for lasso-type penalties. In this paper, we develop a robust variable selection algorithm for meta-analyzing high-dimensional datasets based on logistic regression. We first search an outlier-free subset from each dataset by borrowing information across the datasets with repeatedly use of the least trimmed squared estimates for the logistic model and together with a hierarchical bi-level variable selection technique. We then refine a reweighting step to further improve the efficiency after obtaining a reliable non-outlier subset. Simulation studies and real data analysis show that our new method can provide more reliable results than the existing meta-analysis methods in the presence of outliers.

BioTech ◽  
2021 ◽  
Vol 10 (1) ◽  
pp. 3
Author(s):  
Yinhao Du ◽  
Kun Fan ◽  
Xi Lu ◽  
Cen Wu

Gene-environment (G×E) interaction is critical for understanding the genetic basis of complex disease beyond genetic and environment main effects. In addition to existing tools for interaction studies, penalized variable selection emerges as a promising alternative for dissecting G×E interactions. Despite the success, variable selection is limited in terms of accounting for multidimensional measurements. Published variable selection methods cannot accommodate structured sparsity in the framework of integrating multiomics data for disease outcomes. In this paper, we have developed a novel variable selection method in order to integrate multi-omics measurements in G×E interaction studies. Extensive studies have already revealed that analyzing omics data across multi-platforms is not only sensible biologically, but also resulting in improved identification and prediction performance. Our integrative model can efficiently pinpoint important regulators of gene expressions through sparse dimensionality reduction, and link the disease outcomes to multiple effects in the integrative G×E studies through accommodating a sparse bi-level structure. The simulation studies show the integrative model leads to better identification of G×E interactions and regulators than alternative methods. In two G×E lung cancer studies with high dimensional multi-omics data, the integrative model leads to an improved prediction and findings with important biological implications.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Jing Tian ◽  
Jianping Zhao ◽  
Chunhou Zheng

Abstract Background In recent years, various sequencing techniques have been used to collect biomedical omics datasets. It is usually possible to obtain multiple types of omics data from a single patient sample. Clustering of omics data plays an indispensable role in biological and medical research, and it is helpful to reveal data structures from multiple collections. Nevertheless, clustering of omics data consists of many challenges. The primary challenges in omics data analysis come from high dimension of data and small size of sample. Therefore, it is difficult to find a suitable integration method for structural analysis of multiple datasets. Results In this paper, a multi-view clustering based on Stiefel manifold method (MCSM) is proposed. The MCSM method comprises three core steps. Firstly, we established a binary optimization model for the simultaneous clustering problem. Secondly, we solved the optimization problem by linear search algorithm based on Stiefel manifold. Finally, we integrated the clustering results obtained from three omics by using k-nearest neighbor method. We applied this approach to four cancer datasets on TCGA. The result shows that our method is superior to several state-of-art methods, which depends on the hypothesis that the underlying omics cluster class is the same. Conclusion Particularly, our approach has better performance than compared approaches when the underlying clusters are inconsistent. For patients with different subtypes, both consistent and differential clusters can be identified at the same time.


EBioMedicine ◽  
2021 ◽  
Vol 70 ◽  
pp. 103525
Author(s):  
Abhijith Biji ◽  
Oyahida Khatun ◽  
Shachee Swaraj ◽  
Rohan Narayan ◽  
Raju S. Rajmani ◽  
...  

2021 ◽  
Author(s):  
Rosa F Ropero ◽  
M Julia Flores ◽  
Rafael Rumí

<p>Environmental data often present missing values or lack of information that make modelling tasks difficult. Under the framework of SAICMA Research Project, a flood risk management system is modelled for Andalusian Mediterranean catchment using information from the Andalusian Hydrological System. Hourly data were collected from October 2011 to September 2020, and present two issues:</p><ul><li>In Guadarranque River, for the dam level variable there is no data from May to August 2020, probably because of sensor damage.</li> <li>No information about river level is collected in the lower part of Guadiaro River, which make difficult to estimate flood risk in the coastal area.</li> </ul><p>In order to avoid removing dam variable from the entire model (or those missing months), or even reject modelling one river system, this abstract aims to provide modelling solutions based on Bayesian networks (BNs) that overcome this limitation.</p><p><em>Guarranque River. Missing values.</em></p><p>Dataset contains 75687 observations for 6 continuous variables. BNs regression models based on fixed structures (Naïve Bayes, NB, and Tree Augmented Naïve, TAN) were learnt using the complete dataset (until September 2019) with the aim of predicting the dam level variable as accurately as possible. A scenario was carried out with data from October 2019 to March 2020 and compared the prediction made for the target variable with the real data. Results show both NB (rmse: 6.29) and TAN (rmse: 5.74) are able to predict the behaviour of the target variable.</p><p>Besides, a BN based on expert’s structural learning was learnt with real data and both datasets with imputed values by NB and TAN. Results show models learnt with imputed data (NB: 3.33; TAN: 3.07) improve the error rate of model with respect to real data (4.26).</p><p><em>Guadairo River. Lack of information.</em></p><p>Dataset contains 73636 observations with 14 continuous variables. Since rainfall variables present a high percentage of zero values (over 94%), they were discretised by Equal Frequency method with 4 intervals. The aim is to predict flooding risk in the coastal area but no data is collected from this area. Thus, an unsupervised classification based on hybrid BNs was performed. Here, target variable classifies all observations into a set of homogeneous groups and gives, for each observation, the probability of belonging to each group. Results show a total of 3 groups:</p><ul><li>Group 0, “Normal situation”: with rainfall values equal to 0, and mean of river level very low.</li> <li>Group 1, “Storm situation”: mean rainfall values are over 0.3 mm and all river level variables duplicate the mean with respect to group 0.</li> <li>Group 2, “Extreme situation”: Both rainfall and river level means values present the highest values far away from both previous groups.</li> </ul><p>Even when validation shows this methodology is able to identify extreme events, further work is needed. In this sense, data from autumn-winter season (from October 2020 to March 2021) will be used. Including this new information it would be possible to check if last extreme events (flooding event during December and Filomenastorm during January) are identified.</p><p> </p><p> </p><p> </p>


PLoS ONE ◽  
2018 ◽  
Vol 13 (6) ◽  
pp. e0197910 ◽  
Author(s):  
Alexander Kirpich ◽  
Elizabeth A. Ainsworth ◽  
Jessica M. Wedow ◽  
Jeremy R. B. Newman ◽  
George Michailidis ◽  
...  

2018 ◽  
Author(s):  
CR Tench ◽  
Radu Tanasescu ◽  
CS Constantinescu ◽  
DP Auer ◽  
WJ Cottam

AbstractMeta-analysis of published neuroimaging results is commonly performed using coordinate based meta-analysis (CBMA). Most commonly CBMA algorithms detect spatial clustering of reported coordinates across multiple studies by assuming that results relating to the common hypothesis fall in similar anatomical locations. The null hypothesis is that studies report uncorrelated results, which is simulated by random coordinates. It is assumed that multiple clusters are independent yet it is likely that multiple results reported per study are not, and in fact represent a network effect. Here the multiple reported effect sizes (reported peak Z scores) are assumed multivariate normal, and maximum likelihood used to estimate the parameters of the covariance matrix. The hypothesis is that the effect sizes are correlated. The parameters are covariance of effect size, considered as edges of a network, while clusters are considered as nodes. In this way coordinate based meta-analysis of networks (CBMAN) estimates a network of reported meta-effects, rather than multiple independent effects (clusters).CBMAN uses only the same data as CBMA, yet produces extra information in terms of the correlation between clusters. Here it is validated on numerically simulated data, and demonstrated on real data used previously to demonstrate CBMA. The CBMA and CBMAN clusters are similar, despite the very different hypothesis.


2020 ◽  
Vol 17 (11) ◽  
pp. 4813-4818
Author(s):  
Sanaa Al-Marzouki ◽  
Sharifah Alrajhi

We proposed a new family of distributions from a half logistic model called the generalized odd half logistic family. We expressed its density function as a linear combination of exponentiated densities. We calculate some statistical properties as the moments, probability weighted moment, quantile and order statistics. Two new special models are mentioned. We study the estimation of the parameters for the odd generalized half logistic exponential and the odd generalized half logistic Rayleigh models by using maximum likelihood method. One real data set is assesed to illustrate the usefulness of the subject family.


2020 ◽  
Vol 20 (1) ◽  
Author(s):  
Andreas Heinecke ◽  
Marta Tallarita ◽  
Maria De Iorio

Abstract Background Network meta-analysis (NMA) provides a powerful tool for the simultaneous evaluation of multiple treatments by combining evidence from different studies, allowing for direct and indirect comparisons between treatments. In recent years, NMA is becoming increasingly popular in the medical literature and underlying statistical methodologies are evolving both in the frequentist and Bayesian framework. Traditional NMA models are often based on the comparison of two treatment arms per study. These individual studies may measure outcomes at multiple time points that are not necessarily homogeneous across studies. Methods In this article we present a Bayesian model based on B-splines for the simultaneous analysis of outcomes across time points, that allows for indirect comparison of treatments across different longitudinal studies. Results We illustrate the proposed approach in simulations as well as on real data examples available in the literature and compare it with a model based on P-splines and one based on fractional polynomials, showing that our approach is flexible and overcomes the limitations of the latter. Conclusions The proposed approach is computationally efficient and able to accommodate a large class of temporal treatment effect patterns, allowing for direct and indirect comparisons of widely varying shapes of longitudinal profiles.


Mathematics ◽  
2020 ◽  
Vol 8 (1) ◽  
pp. 109
Author(s):  
Francisco J. Ariza-Hernandez ◽  
Martin P. Arciga-Alejandre ◽  
Jorge Sanchez-Ortiz ◽  
Alberto Fleitas-Imbert

In this paper, we consider the inverse problem of derivative order estimation in a fractional logistic model. In order to solve the direct problem, we use the Grünwald-Letnikov fractional derivative, then the inverse problem is tackled within a Bayesian perspective. To construct the likelihood function, we propose an explicit numerical scheme based on the truncated series of the derivative definition. By MCMC samples of the marginal posterior distributions, we estimate the order of the derivative and the growth rate parameter in the dynamic model, as well as the noise in the observations. To evaluate the methodology, a simulation was performed using synthetic data, where the bias and mean square error are calculated, the results give evidence of the effectiveness for the method and the suitable performance of the proposed model. Moreover, an example with real data is presented as evidence of the relevance of using a fractional model.


Sign in / Sign up

Export Citation Format

Share Document