Computationally scalable regression modeling for ultrahigh-dimensional omics data with ParProx

AbstractStatistical analysis of ultrahigh-dimensional omics scale data has long depended on univariate hypothesis testing. With growing data features and samples, the obvious next step is to establish multivariable association analysis as a routine method for understanding genotype-phenotype associations. Here we present ParProx, a state-of-the-art implementation to optimize overlapping group lasso regression models for time-to-event and classification analysis, guided by biological priors through coordinated variable selection. ParProx not only enables model fitting for ultrahigh-dimensional data within the architecture for parallel or distributed computing, but also allows users to obtain interpretable regression models consistent with known biological relationships among the independent variables, a feature long neglected in statistical modeling of omics data. We demonstrate ParProx using three different omics data sets of moderate to large numbers of variables, where we use genomic regions and pathways to arrive at sparse regression models comprised of biologically related independent variables. ParProx is naturally applicable to a wide range of studies using ultrahigh-dimensional omics data, ranging from genome-wide association analysis to single cell sequencing studies where multivariable modeling is computationally intractable.

Download Full-text

multiSLIDE: a web server for exploring connected elements of biological pathways in multi-omics data

10.1101/812271 ◽

2019 ◽

Author(s):

Soumita Ghosh ◽

Abhik Datta ◽

Hyungwon Choi

Keyword(s):

Keyword Search ◽

Data Sets ◽

Omics Data ◽

Web Based ◽

Molecular Features ◽

External Data ◽

Cluster Data ◽

Wide Range ◽

Time Basis ◽

Gene Ontologies

AbstractEmerging multi-omics experiments pose new challenges for exploration of quantitative data sets. We present multiSLIDE, a web-based interactive tool for simultaneous heatmap visualization of interconnected molecular features in multi-omics data sets. multiSLIDE operates by keyword search for visualizing biologically connected molecular features, such as genes in pathways and Gene Ontologies, offering convenient functionalities to rearrange, filter, and cluster data sets on a web browser in a real time basis. Various built-in querying mechanisms make it adaptable to diverse omics types, and visualizations are fully customizable. We demonstrate the versatility of the tool through three example studies, each of which showcases its applicability to a wide range of multi-omics data sets, ability to visualize the links between molecules at different granularities of measurement units, and the interface to incorporate inter-molecular relationship from external data sources into the visualization. Online and standalone versions of multiSLIDE are available at https://github.com/soumitag/multiSLIDE.

Download Full-text

Use of a General Dose–Response Model for Rockfish Fecundity–Length Relationships

Canadian Journal of Fisheries and Aquatic Sciences ◽

10.1139/f90-134 ◽

1990 ◽

Vol 47 (6) ◽

pp. 1148-1156 ◽

Cited By ~ 1

Author(s):

Laura J. Richards ◽

Jon T. Schnute

Keyword(s):

Dose Response ◽

Regression Models ◽

Data Sets ◽

Response Model ◽

Exact Inference ◽

Special Cases ◽

Wide Range ◽

Length Data ◽

The Relationship ◽

General Method

In this paper we describe a general method for determining the relationship between fecundity and another fish attribute, such as size or age. Our methods include linear and logarithmic regression models as special cases and are applicable to a wide range of situations. The model we propose is based on the univariate form of the Schnute–Jensen dose–response model. However, we extend the Schnute–Jensen analysis by describing exact inference regions obtained from likelihood contours, to which we assign nominal probability levels. We also provide a method for obtaining an inference band for the predicted curve. We examine the issue of model adequacy as it relates to fecundity–length data from two rockfish (Sebastes) species. We show that the extra complexity of our model is justified, as none of the traditional regression models are appropriate for all three of our data sets. Further, we use inference bands to distinguish fecundity–length relationships for quillback rockfish (S. maliger) from two areas, but we are unable to distinguish one of these relationships from a similar relationship for copper rockfish (S. caurinus).

Download Full-text

multiSLIDE is a web server for exploring connected elements of biological pathways in multi-omics data

Nature Communications ◽

10.1038/s41467-021-22650-x ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Soumita Ghosh ◽

Abhik Datta ◽

Hyungwon Choi

Keyword(s):

Keyword Search ◽

Data Sets ◽

Omics Data ◽

Web Browser ◽

Web Based ◽

Molecular Features ◽

Cluster Data ◽

Wide Range ◽

Or Genes ◽

Simultaneous Visualization

AbstractQuantitative multi-omics data are difficult to interpret and visualize due to large volume of data, complexity among data features, and heterogeneity of information represented by different omics platforms. Here, we present multiSLIDE, a web-based interactive tool for the simultaneous visualization of interconnected molecular features in heatmaps of multi-omics data sets. multiSLIDE visualizes biologically connected molecular features by keyword search of pathways or genes, offering convenient functionalities to query, rearrange, filter, and cluster data on a web browser in real time. Various querying mechanisms make it adaptable to diverse omics types, and visualizations are customizable. We demonstrate the versatility of multiSLIDE through three examples, showcasing its applicability to a wide range of multi-omics data sets, by allowing users to visualize established links between molecules from different omics data, as well as incorporate custom inter-molecular relationship information into the visualization. Online and stand-alone versions of multiSLIDE are available at https://github.com/soumitag/multiSLIDE.

Download Full-text

mixOmics: an R package for ‘omics feature selection and multiple data integration

10.1101/108597 ◽

2017 ◽

Cited By ~ 19

Author(s):

Florian Rohart ◽

Benoît Gautier ◽

Amrit Singh ◽

Kim-Anh Lê Cao

Keyword(s):

Data Integration ◽

Large Scale ◽

Relevant Information ◽

R Package ◽

Biological Data ◽

Molecular Signature ◽

Single Type ◽

Data Sets ◽

Omics Data ◽

Wide Range

AbstractThe advent of high throughput technologies has led to a wealth of publicly available ‘omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a ‘molecular signature’) to explain or predict biological conditions, but mainly for a single type of ‘omics. In addition, commonly used methods are univariate and consider each biological feature independently.We introducemixOmics, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a system biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous ‘omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple ‘omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latestmixOmicsintegrative frameworks for the multivariate analyses of ‘omics data available from the package.

Download Full-text

OmicsOne: associate omics data with phenotypes in one-click

Clinical Proteomics ◽

10.1186/s12014-021-09334-w ◽

2021 ◽

Vol 18 (1) ◽

Author(s):

Hui Zhang ◽

Minghui Ao ◽

Arianna Boja ◽

Michael Schnaubelt ◽

Yingwei Hu

Keyword(s):

Quality Control ◽

Data Analysis ◽

Association Analysis ◽

Lung Squamous Cell Carcinoma ◽

Data Sets ◽

Omics Data ◽

Data Set ◽

Cancer Data ◽

Public Data ◽

Potential Biomarkers

Abstract Background The rapid advancements of high throughput “omics” technologies have brought a massive amount of data to process during and after experiments. Multi-omic analysis facilitates a deeper interrogation of a dataset and the discovery of interesting genes, proteins, lipids, glycans, metabolites, or pathways related to the corresponding phenotypes in a study. Many individual software tools have been developed for data analysis and visualization. However, it still lacks an efficient way to investigate the phenotypes with multiple omics data. Here, we present OmicsOne as an interactive web-based framework for rapid phenotype association analysis of multi-omic data by integrating quality control, statistical analysis, and interactive data visualization on ‘one-click’. Materials and methods OmicsOne was applied on the previously published proteomic and glycoproteomic data sets of high-grade serous ovarian carcinoma (HGSOC) and the published proteome data set of lung squamous cell carcinoma (LSCC) to confirm its performance. The data was analyzed through six main functional modules implemented in OmicsOne: (1) phenotype profiling, (2) data preprocessing and quality control, (3) knowledge annotation, (4) phenotype associated features discovery, (5) correlation and regression model analysis for phenotype association analysis on individual features, and (6) enrichment analysis for phenotype association analysis on interested feature sets. Results We developed an integrated software solution, OmicsOne, for the phenotype association analysis on multi-omics data sets. The application of OmicsOne on the public data set of ovarian cancer data showed that the software could confirm the previous observations consistently and discover new evidence for HNRNPU and a glycopeptide of HYOU1 as potential biomarkers for HGSOC data sets. The performance of OmicsOne was further demonstrated in the Tumor and NAT comparison study on the proteome data set of LSCC. Conclusions OmicsOne can effectively simplify data analysis and reveal the significant associations between phenotypes and potential biomarkers, including genes, proteins, and glycopeptides, in minutes to assist users to understand aberrant biological processes.

Download Full-text

A Universal Delayed Difference Model Fitting Dose-response Curves

Dose-Response ◽

10.1177/15593258211062785 ◽

2021 ◽

Vol 19 (4) ◽

pp. 155932582110627

Author(s):

Linqian Yang ◽

Jiaying Wang ◽

Robert A. Cheke ◽

Sanyi Tang

Keyword(s):

Dose Response ◽

Model Fitting ◽

Plant Management ◽

Data Sets ◽

Difference Model ◽

Response Curves ◽

Response Models ◽

Response Data ◽

Wide Range ◽

Dose Response Curves

Purpose Dose-response curves, which fit a multitude of experimental data derived from toxicology, are widely used in physics, chemistry, biology, and other fields. Although there are many dose-response models for fitting dose-response curves, the application of these models is limited by many restrictions and lacks universality, so there is a need for a novel, universal dynamical model that can improve fits to various types of dose-response curves. Methods We expand the hormetic Ricker model, taking the delay inherent in the dose-response into account, and develop a novel and dynamic delayed Ricker difference model (DRDM) to fit various types of dose-response curves. Furthermore, we compare the DRDM with other dose-response models to confirm that it can mimic different types of dose-response curves. Data analysis By fitting various types of dose-response data sets derived from drug applications, disease treatment, pest control, and plant management, and comparing the imitative effect of the DRDM with other models, we find that the DRDM fits monotonic dose-response data well and, in most circumstances, the DRDM has a better imitative effect to non-monotonic dose-response data with hormesis than other models do. Results The MSE of fits of the DRDM to S-shaped dose-response data (DS2-G) is not lower than those for four other models, but the MSE of fits to U-shaped (DS7) and inverted U-shaped dose-response data (DS10) were lower than for two other models. This means that the imitative effect of the DRDM is comparable to other models of monotonic dose-response data, but is a significant improvement compared to traditional models of non-monotonic dose-response data with hormesis. Conclusion We propose a novel dynamic model (DRDM) for fitting to various types of dose-response curves, which can reflect the dynamic trend of the population growth compared with traditional static dose-response models. By analyzing data, we have confirmed that the DRDM provides an ideal description of various dose-response observations and it can be used to fit a wide range of dose-response data sets, especially for hormetic data sets. Therefore, we conclude that the DRDM has a good universality for dose-response curve fitting.

Download Full-text

CATCh, an Ensemble Classifier for Chimera Detection in 16S rRNA Sequencing Studies

Applied and Environmental Microbiology ◽

10.1128/aem.02896-14 ◽

2014 ◽

Vol 81 (5) ◽

pp. 1573-1584 ◽

Cited By ~ 26

Author(s):

Mohamed Mysara ◽

Yvan Saeys ◽

Natalie Leys ◽

Jeroen Raes ◽

Pieter Monsieurs

Keyword(s):

16S Rrna ◽

De Novo ◽

Pcr Amplification ◽

Illumina Miseq ◽

Ensemble Classifier ◽

Marker Genes ◽

Data Sets ◽

Sequencing Data ◽

Wide Range ◽

Sequencing Studies

ABSTRACTIn ecological studies, microbial diversity is nowadays mostly assessed via the detection of phylogenetic marker genes, such as 16S rRNA. However, PCR amplification of these marker genes produces a significant amount of artificial sequences, often referred to as chimeras. Different algorithms have been developed to remove these chimeras, but efforts to combine different methodologies are limited. Therefore, two machine learning classifiers (reference-based andde novoCATCh) were developed by integrating the output of existing chimera detection tools into a new, more powerful method. When comparing our classifiers with existing tools in either the reference-based orde novomode, a higher performance of our ensemble method was observed on a wide range of sequencing data, including simulated, 454 pyrosequencing, and Illumina MiSeq data sets. Since our algorithm combines the advantages of different individual chimera detection tools, our approach produces more robust results when challenged with chimeric sequences having a low parent divergence, short length of the chimeric range, and various numbers of parents. Additionally, it could be shown that integrating CATCh in the preprocessing pipeline has a beneficial effect on the quality of the clustering in operational taxonomic units.

Download Full-text

mtDNAcombine: tools to combine sequences from multiple studies

BMC Bioinformatics ◽

10.1186/s12859-021-04048-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Eleanor F. Miller ◽

Andrea Manica

Keyword(s):

Sequence Data ◽

Data Extraction ◽

Bayesian Skyline Plot ◽

Model Organisms ◽

Data Sets ◽

Data Handling ◽

Online Database ◽

Genetic Studies ◽

Wide Range ◽

Existing Data

Abstract Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.

Download Full-text

EFFECTS OF BREED AND SEX ON THE PATTERNS OF FAT DEPOSITION AND DISTRIBUTION IN SWINE

Canadian Journal of Animal Science ◽

10.4141/cjas80-031 ◽

1980 ◽

Vol 60 (2) ◽

pp. 223-230 ◽

Cited By ~ 14

Author(s):

S. D. M. JONES ◽

R. J. RICHMOND ◽

M. A. PRICE ◽

R. B. BERG

Keyword(s):

Subcutaneous Fat ◽

Fat Distribution ◽

Allometric Equation ◽

Muscle Weight ◽

Constant Weight ◽

Independent Variables ◽

Fat Depots ◽

Wide Range ◽

Sex Type ◽

Breed Differences

The growth and distribution of fat from 163 pig carcasses were compared among five breeds (Duroc × Yorkshire (D × Y), Hampshire × Yorkshire (H × Y), Yorkshire (Y × Y), Yorkshire × Lacombe-Yorkshire (Y × L-Y) and Lacombe × Yorkshire (L × Y)) and two sex-types (barrows and gilts) over a wide range in carcass weight. The growth pattern of fat and the fat depots were estimated from the allometric equation (Y = aXb) using side muscle weight and side fat weight separately as independent variables. Growth coefficients (b) for intermuscular and subcutaneous fat depots were similar for the hindquarter but the intermuscular depot coefficient was slightly higher for the forequarter. The coefficient for body cavity fat was highest in all comparisons. No significant differences were detected for coefficients among breeds and between sexes using both total muscle and total side fat as independent variables. Significant breed and sex-type differences were found in the fat depots at a constant weight of side muscle. This would indicate that breed differences in fatness seemed to be more influenced by the initiation of fattening at different muscle weights than by any inherent differences in rate of fattening. Significant breed differences were also found in the fat depots at a constant fat weight, indicating that breed may influence fat distribution. Sex-type had no effect on fat distribution when the evaluation was made at constant fatness.

Download Full-text

MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

BMC Bioinformatics ◽

10.1186/s12859-021-04288-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yance Feng ◽

Lei M. Li

Keyword(s):

Biological Significance ◽

Housekeeping Genes ◽

R Package ◽

Data Sets ◽

Statistical Regression ◽

Rna Seq ◽

Least Trimmed Squares ◽

Standard Data ◽

Wide Range ◽

Multiple References

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

Download Full-text