scholarly journals Computationally scalable regression modeling for ultrahigh-dimensional omics data with ParProx

2021 ◽  
Author(s):  
Seyoon Ko ◽  
Ginny X. Li ◽  
Hyungwon Choi ◽  
Joong-Ho Won

AbstractStatistical analysis of ultrahigh-dimensional omics scale data has long depended on univariate hypothesis testing. With growing data features and samples, the obvious next step is to establish multivariable association analysis as a routine method for understanding genotype-phenotype associations. Here we present ParProx, a state-of-the-art implementation to optimize overlapping group lasso regression models for time-to-event and classification analysis, guided by biological priors through coordinated variable selection. ParProx not only enables model fitting for ultrahigh-dimensional data within the architecture for parallel or distributed computing, but also allows users to obtain interpretable regression models consistent with known biological relationships among the independent variables, a feature long neglected in statistical modeling of omics data. We demonstrate ParProx using three different omics data sets of moderate to large numbers of variables, where we use genomic regions and pathways to arrive at sparse regression models comprised of biologically related independent variables. ParProx is naturally applicable to a wide range of studies using ultrahigh-dimensional omics data, ranging from genome-wide association analysis to single cell sequencing studies where multivariable modeling is computationally intractable.

2019 ◽  
Author(s):  
Soumita Ghosh ◽  
Abhik Datta ◽  
Hyungwon Choi

AbstractEmerging multi-omics experiments pose new challenges for exploration of quantitative data sets. We present multiSLIDE, a web-based interactive tool for simultaneous heatmap visualization of interconnected molecular features in multi-omics data sets. multiSLIDE operates by keyword search for visualizing biologically connected molecular features, such as genes in pathways and Gene Ontologies, offering convenient functionalities to rearrange, filter, and cluster data sets on a web browser in a real time basis. Various built-in querying mechanisms make it adaptable to diverse omics types, and visualizations are fully customizable. We demonstrate the versatility of the tool through three example studies, each of which showcases its applicability to a wide range of multi-omics data sets, ability to visualize the links between molecules at different granularities of measurement units, and the interface to incorporate inter-molecular relationship from external data sources into the visualization. Online and standalone versions of multiSLIDE are available at https://github.com/soumitag/multiSLIDE.


1990 ◽  
Vol 47 (6) ◽  
pp. 1148-1156 ◽  
Author(s):  
Laura J. Richards ◽  
Jon T. Schnute

In this paper we describe a general method for determining the relationship between fecundity and another fish attribute, such as size or age. Our methods include linear and logarithmic regression models as special cases and are applicable to a wide range of situations. The model we propose is based on the univariate form of the Schnute–Jensen dose–response model. However, we extend the Schnute–Jensen analysis by describing exact inference regions obtained from likelihood contours, to which we assign nominal probability levels. We also provide a method for obtaining an inference band for the predicted curve. We examine the issue of model adequacy as it relates to fecundity–length data from two rockfish (Sebastes) species. We show that the extra complexity of our model is justified, as none of the traditional regression models are appropriate for all three of our data sets. Further, we use inference bands to distinguish fecundity–length relationships for quillback rockfish (S. maliger) from two areas, but we are unable to distinguish one of these relationships from a similar relationship for copper rockfish (S. caurinus).


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Soumita Ghosh ◽  
Abhik Datta ◽  
Hyungwon Choi

AbstractQuantitative multi-omics data are difficult to interpret and visualize due to large volume of data, complexity among data features, and heterogeneity of information represented by different omics platforms. Here, we present multiSLIDE, a web-based interactive tool for the simultaneous visualization of interconnected molecular features in heatmaps of multi-omics data sets. multiSLIDE visualizes biologically connected molecular features by keyword search of pathways or genes, offering convenient functionalities to query, rearrange, filter, and cluster data on a web browser in real time. Various querying mechanisms make it adaptable to diverse omics types, and visualizations are customizable. We demonstrate the versatility of multiSLIDE through three examples, showcasing its applicability to a wide range of multi-omics data sets, by allowing users to visualize established links between molecules from different omics data, as well as incorporate custom inter-molecular relationship information into the visualization. Online and stand-alone versions of multiSLIDE are available at https://github.com/soumitag/multiSLIDE.


2017 ◽  
Author(s):  
Florian Rohart ◽  
Benoît Gautier ◽  
Amrit Singh ◽  
Kim-Anh Lê Cao

AbstractThe advent of high throughput technologies has led to a wealth of publicly available ‘omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a ‘molecular signature’) to explain or predict biological conditions, but mainly for a single type of ‘omics. In addition, commonly used methods are univariate and consider each biological feature independently.We introducemixOmics, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a system biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous ‘omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple ‘omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latestmixOmicsintegrative frameworks for the multivariate analyses of ‘omics data available from the package.


2021 ◽  
Vol 18 (1) ◽  
Author(s):  
Hui Zhang ◽  
Minghui Ao ◽  
Arianna Boja ◽  
Michael Schnaubelt ◽  
Yingwei Hu

Abstract Background The rapid advancements of high throughput “omics” technologies have brought a massive amount of data to process during and after experiments. Multi-omic analysis facilitates a deeper interrogation of a dataset and the discovery of interesting genes, proteins, lipids, glycans, metabolites, or pathways related to the corresponding phenotypes in a study. Many individual software tools have been developed for data analysis and visualization. However, it still lacks an efficient way to investigate the phenotypes with multiple omics data. Here, we present OmicsOne as an interactive web-based framework for rapid phenotype association analysis of multi-omic data by integrating quality control, statistical analysis, and interactive data visualization on ‘one-click’. Materials and methods OmicsOne was applied on the previously published proteomic and glycoproteomic data sets of high-grade serous ovarian carcinoma (HGSOC) and the published proteome data set of lung squamous cell carcinoma (LSCC) to confirm its performance. The data was analyzed through six main functional modules implemented in OmicsOne: (1) phenotype profiling, (2) data preprocessing and quality control, (3) knowledge annotation, (4) phenotype associated features discovery, (5) correlation and regression model analysis for phenotype association analysis on individual features, and (6) enrichment analysis for phenotype association analysis on interested feature sets. Results We developed an integrated software solution, OmicsOne, for the phenotype association analysis on multi-omics data sets. The application of OmicsOne on the public data set of ovarian cancer data showed that the software could confirm the previous observations consistently and discover new evidence for HNRNPU and a glycopeptide of HYOU1 as potential biomarkers for HGSOC data sets. The performance of OmicsOne was further demonstrated in the Tumor and NAT comparison study on the proteome data set of LSCC. Conclusions OmicsOne can effectively simplify data analysis and reveal the significant associations between phenotypes and potential biomarkers, including genes, proteins, and glycopeptides, in minutes to assist users to understand aberrant biological processes.


Dose-Response ◽  
2021 ◽  
Vol 19 (4) ◽  
pp. 155932582110627
Author(s):  
Linqian Yang ◽  
Jiaying Wang ◽  
Robert A. Cheke ◽  
Sanyi Tang

Purpose Dose-response curves, which fit a multitude of experimental data derived from toxicology, are widely used in physics, chemistry, biology, and other fields. Although there are many dose-response models for fitting dose-response curves, the application of these models is limited by many restrictions and lacks universality, so there is a need for a novel, universal dynamical model that can improve fits to various types of dose-response curves. Methods We expand the hormetic Ricker model, taking the delay inherent in the dose-response into account, and develop a novel and dynamic delayed Ricker difference model (DRDM) to fit various types of dose-response curves. Furthermore, we compare the DRDM with other dose-response models to confirm that it can mimic different types of dose-response curves. Data analysis By fitting various types of dose-response data sets derived from drug applications, disease treatment, pest control, and plant management, and comparing the imitative effect of the DRDM with other models, we find that the DRDM fits monotonic dose-response data well and, in most circumstances, the DRDM has a better imitative effect to non-monotonic dose-response data with hormesis than other models do. Results The MSE of fits of the DRDM to S-shaped dose-response data (DS2-G) is not lower than those for four other models, but the MSE of fits to U-shaped (DS7) and inverted U-shaped dose-response data (DS10) were lower than for two other models. This means that the imitative effect of the DRDM is comparable to other models of monotonic dose-response data, but is a significant improvement compared to traditional models of non-monotonic dose-response data with hormesis. Conclusion We propose a novel dynamic model (DRDM) for fitting to various types of dose-response curves, which can reflect the dynamic trend of the population growth compared with traditional static dose-response models. By analyzing data, we have confirmed that the DRDM provides an ideal description of various dose-response observations and it can be used to fit a wide range of dose-response data sets, especially for hormetic data sets. Therefore, we conclude that the DRDM has a good universality for dose-response curve fitting.


2014 ◽  
Vol 81 (5) ◽  
pp. 1573-1584 ◽  
Author(s):  
Mohamed Mysara ◽  
Yvan Saeys ◽  
Natalie Leys ◽  
Jeroen Raes ◽  
Pieter Monsieurs

ABSTRACTIn ecological studies, microbial diversity is nowadays mostly assessed via the detection of phylogenetic marker genes, such as 16S rRNA. However, PCR amplification of these marker genes produces a significant amount of artificial sequences, often referred to as chimeras. Different algorithms have been developed to remove these chimeras, but efforts to combine different methodologies are limited. Therefore, two machine learning classifiers (reference-based andde novoCATCh) were developed by integrating the output of existing chimera detection tools into a new, more powerful method. When comparing our classifiers with existing tools in either the reference-based orde novomode, a higher performance of our ensemble method was observed on a wide range of sequencing data, including simulated, 454 pyrosequencing, and Illumina MiSeq data sets. Since our algorithm combines the advantages of different individual chimera detection tools, our approach produces more robust results when challenged with chimeric sequences having a low parent divergence, short length of the chimeric range, and various numbers of parents. Additionally, it could be shown that integrating CATCh in the preprocessing pipeline has a beneficial effect on the quality of the clustering in operational taxonomic units.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Eleanor F. Miller ◽  
Andrea Manica

Abstract Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.


1980 ◽  
Vol 60 (2) ◽  
pp. 223-230 ◽  
Author(s):  
S. D. M. JONES ◽  
R. J. RICHMOND ◽  
M. A. PRICE ◽  
R. B. BERG

The growth and distribution of fat from 163 pig carcasses were compared among five breeds (Duroc × Yorkshire (D × Y), Hampshire × Yorkshire (H × Y), Yorkshire (Y × Y), Yorkshire × Lacombe-Yorkshire (Y × L-Y) and Lacombe × Yorkshire (L × Y)) and two sex-types (barrows and gilts) over a wide range in carcass weight. The growth pattern of fat and the fat depots were estimated from the allometric equation (Y = aXb) using side muscle weight and side fat weight separately as independent variables. Growth coefficients (b) for intermuscular and subcutaneous fat depots were similar for the hindquarter but the intermuscular depot coefficient was slightly higher for the forequarter. The coefficient for body cavity fat was highest in all comparisons. No significant differences were detected for coefficients among breeds and between sexes using both total muscle and total side fat as independent variables. Significant breed and sex-type differences were found in the fat depots at a constant weight of side muscle. This would indicate that breed differences in fatness seemed to be more influenced by the initiation of fattening at different muscle weights than by any inherent differences in rate of fattening. Significant breed differences were also found in the fat depots at a constant fat weight, indicating that breed may influence fat distribution. Sex-type had no effect on fat distribution when the evaluation was made at constant fatness.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yance Feng ◽  
Lei M. Li

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.


Sign in / Sign up

Export Citation Format

Share Document