STATegra: Multi-Omics Data Integration – A Conceptual Scheme With a Bioinformatics Pipeline

Technologies for profiling samples using different omics platforms have been at the forefront since the human genome project. Large-scale multi-omics data hold the promise of deciphering different regulatory layers. Yet, while there is a myriad of bioinformatics tools, each multi-omics analysis appears to start from scratch with an arbitrary decision over which tools to use and how to combine them. Therefore, it is an unmet need to conceptualize how to integrate such data and implement and validate pipelines in different cases. We have designed a conceptual framework (STATegra), aiming it to be as generic as possible for multi-omics analysis, combining available multi-omic anlaysis tools (machine learning component analysis, non-parametric data combination, and a multi-omics exploratory analysis) in a step-wise manner. While in several studies, we have previously combined those integrative tools, here, we provide a systematic description of the STATegra framework and its validation using two The Cancer Genome Atlas (TCGA) case studies. For both, the Glioblastoma and the Skin Cutaneous Melanoma (SKCM) cases, we demonstrate an enhanced capacity of the framework (and beyond the individual tools) to identify features and pathways compared to single-omics analysis. Such an integrative multi-omics analysis framework for identifying features and components facilitates the discovery of new biology. Finally, we provide several options for applying the STATegra framework when parametric assumptions are fulfilled and for the case when not all the samples are profiled for all omics. The STATegra framework is built using several tools, which are being integrated step-by-step as OpenSource in the STATegRa Bioconductor package.1

Download Full-text

STATegra: Multi-omics data integration - A conceptual scheme and a bioinformatics pipeline

10.1101/2020.11.20.391045 ◽

2020 ◽

Author(s):

Nuria Planell ◽

Vincenzo Lagani ◽

Patricia Sebastian-Leon ◽

Frans van der Kloet ◽

Ewoud Ewing ◽

...

Keyword(s):

Large Scale ◽

Unmet Need ◽

Genome Project ◽

Omics Data ◽

Bioinformatics Pipeline ◽

Omics Analysis ◽

Parametric Data ◽

Arbitrary Decision ◽

The Human Genome Project ◽

Omics Data Integration

AbstractTechnologies for profiling samples using different omics platforms have been at the forefront since the human genome project. Large-scale multi-omics data hold the promise of deciphering different regulatory layers. Yet, while there is a myriad of bioinformatics tools, each multi-omics analysis appears to start from scratch with an arbitrary decision over which tools to use and how to combine them. It is therefore an unmet need to conceptualize how to integrate such data and to implement and validate pipelines in different cases. We have designed a conceptual framework (STATegra), aiming it to be as generic as possible for multi-omics analysis, combining machine learning component analysis, non-parametric data combination and a multi-omics exploratory analysis in a step-wise manner. While in several studies we have previously combined those integrative tools, here we provide a systematic description of the STATegra framework and its validation using two TCGA case studies. For both, the Glioblastoma and the Skin Cutaneous Melanoma cases, we demonstrate an enhanced capacity to identify features in comparison to single-omics analysis. Such an integrative multi-omics analysis framework for the identification of features and components facilitates the discovery of new biology. Finally, we provide several options for applying the STATegra framework when parametric assumptions are fulfilled, and for the case when not all the samples are profiled for all omics. The STATegra framework is built using several tools, which are being integrated step-by-step as OpenSource in the STATegRa Bioconductor package https://bioconductor.org/packages/release/bioc/html/STATegra.html.

Download Full-text

Recent advances in functional genome analysis

F1000Research ◽

10.12688/f1000research.15274.1 ◽

2018 ◽

Vol 7 ◽

pp. 1968 ◽

Cited By ~ 3

Author(s):

Roderic Guigo ◽

Michiel de Hoon

Keyword(s):

Functional Genomics ◽

Human Genome ◽

Large Scale ◽

Massively Parallel Sequencing ◽

Genome Project ◽

Biological Traits ◽

Cellular Behavior ◽

Sequencing Technologies ◽

Recent Advances ◽

The Human Genome Project

At the beginning of this century, the Human Genome Project produced the first drafts of the human genome sequence. Following this, large-scale functional genomics studies were initiated to understand the molecular basis underlying the translation of the instructions encoded in the genome into the biological traits of organisms. Instrumental in the ensuing revolution in functional genomics were the rapid advances in massively parallel sequencing technologies as well as the development of a wide diversity of protocols that make use of these technologies to understand cellular behavior at the molecular level. Here, we review recent advances in functional genomic methods, discuss some of their current capabilities and limitations, and briefly sketch future directions within the field.

Download Full-text

Multi-omics Data Integration by Generative Adversarial Network

10.1101/2021.03.13.435251 ◽

2021 ◽

Author(s):

Khandakar Tanvir Ahmed ◽

Jiao Sun ◽

Jeongsik Yong ◽

Wei Zhang

Keyword(s):

Large Scale ◽

Synthetic Data ◽

Interaction Network ◽

Vital Role ◽

The Cancer Genome Atlas ◽

Survival Prediction ◽

Omics Data ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Cancer Outcome

Accurate disease phenotype prediction plays an important role in the treatment of heterogeneous diseases like cancer in the era of precision medicine. With the advent of high throughput technologies, more comprehensive multi-omics data is now available that can effectively link the genotype to phenotype. However, the interactive relation of multi-omics datasets makes it particularly challenging to incorporate different biological layers to discover the coherent biological signatures and predict phenotypic outcomes. In this study, we introduce omicsGAN, a generative adversarial network (GAN) model to integrate two omics data and their interaction network. The model captures information from the interaction network as well as the two omics datasets and fuse them to generate synthetic data with better predictive signals. Large-scale experiments on The Cancer Genome Atlas (TCGA) breast cancer and ovarian cancer datasets validate that (1) the model can effectively integrate two omics data (i.e., mRNA and microRNA expression data) and their interaction network (i.e., microRNA-mRNA interaction network). The synthetic omics data generated by the proposed model has a better performance on cancer outcome classification and patients survival prediction compared to original omics datasets. (2) The integrity of the interaction network plays a vital role in the generation of synthetic data with higher predictive quality. Using a random interaction network does not allow the framework to learn meaningful information from the omics datasets; therefore, results in synthetic data with weaker predictive signals.

Download Full-text

Network integration of data and analysis of oncology interest

Journal of Integrative Bioinformatics ◽

10.1515/jib-2006-21 ◽

2006 ◽

Vol 3 (1) ◽

pp. 45-55

Author(s):

P. Romano ◽

G. Bertolini ◽

F. De Paoli ◽

M. Fattore ◽

D. Marra ◽

...

Keyword(s):

Web Services ◽

Large Scale ◽

New Technologies ◽

Workflow Management ◽

Genome Project ◽

Management Systems ◽

Workflow Management Systems ◽

The Creation ◽

The Human Genome Project ◽

Integrate Data

Summary The Human Genome Project has deeply transformed biology and the field has since then expanded to the management, processing, analysis and visualization of large quantities of data from genomics, proteomics, medicinal chemistry and drug screening. This huge amount of data and the heterogeneity of software tools that are used implies the adoption on a very large scale of new, flexible tools that can enable researchers to integrate data and analysis on the network. ICT technology standards and tools, like Web Services and related languages, and workflow management systems, can support the creation and deployment of such systems. While a number of Web Services are appearing and personal workflow management systems are also being more and more offered to researchers, a reference portal enabling the vast majority of unskilled researchers to take profit from these new technologies is still lacking. In this paper, we introduce the rationale for the creation of such a portal and present the architecture and some preliminary results for the development of a portal for the enactment of workflows of interest in oncology.

Download Full-text

An Overview of Ethics and Public Health Genetics

The Oxford Handbook of Public Health Ethics ◽

10.1093/oxfordhb/9780190245191.013.55 ◽

2019 ◽

pp. 633-641

Author(s):

Debra J. H. Mathews

Keyword(s):

Public Health ◽

Human Genome ◽

Large Scale ◽

Genome Project ◽

Community Genetics ◽

Public Health Genetics ◽

The West ◽

Genetics And Genomics ◽

The 1960S ◽

The Human Genome Project

Public health genetics (more commonly referred to as “community genetics” in Europe) has been practiced to some degree in the West since at least the 1960s, but the development of a cohesive field took time and advances in technology. The application of genetics and genomics to prevent disease and promote public health became firmly established as a field in the late 1990s, as large-scale sequencing of the human genome as part of the Human Genome Project began. The field is now thriving, leading to both tremendous public health benefits and risks for both individuals and populations. This chapter provides an overview of the section of The Oxford Handbook of Public Health Ethics dedicated to public health genetics. The chapters roughly trace the evolution of public health genetics from its roots in eugenics, to the present challenges faced in newborn screening and biobanking, and finally to emerging questions raised by the application of genomics to infectious disease.

Download Full-text

Controversies of Progress: The Human Genome

Darden Business Publishing Cases ◽

10.1108/case.darden.2016.000079 ◽

2017 ◽

pp. 1-6

Author(s):

R. Edward Freeman ◽

Pia Ahmad ◽

Will Truslow

Keyword(s):

Human Genome ◽

Human Genome Project ◽

Health Care Industry ◽

Biotechnology Industry ◽

Genome Project ◽

Medical System ◽

Genetic Profile ◽

Care Industry ◽

The Individual ◽

The Human Genome Project

The case lays out the controversies surrounding the human genome project. The ability to generate any individual's genetic profile raises important legal, social, and ethical questions. How should these issues be addressed and how can the rights of the individual be protected? This case can be used to portray problems in the biotechnology industry, confidentiality in the health care industry, as well as the progress of technology and the ability of the present legal and medical system to deal with it.

Download Full-text

The Human Genome Project: Lessons from Large-Scale Biology

Science ◽

10.1126/science.1084564 ◽

2003 ◽

Vol 300 (5617) ◽

pp. 286-290 ◽

Cited By ~ 562

Author(s):

F. S. Collins

Keyword(s):

Human Genome ◽

Human Genome Project ◽

Large Scale ◽

Genome Project ◽

The Human Genome Project

Download Full-text

Emerging Biosecurity Threats and Responses: A Review of Published and Gray Literature

10.1007/978-94-024-2086-9_2 ◽

2021 ◽

pp. 13-36

Author(s):

Christopher L. Cummings ◽

Kaitlin M. Volk ◽

Anna A. Ulanova ◽

Do Thuy Uyen Ha Lam ◽

Pei Rou Ng

Keyword(s):

World War Ii ◽

Human Genome ◽

Large Scale ◽

Chemical Engineering ◽

Human Genetics ◽

Genome Project ◽

Agricultural Crop ◽

Crop Selection ◽

Research Initiatives ◽

The Human Genome Project

AbstractThe field of biotechnology has been rigorously researched and applied to many facets of everyday life. Biotechnology is defined as the process of modifying an organism or a biological system for an intended purpose. Biotechnology applications range from agricultural crop selection to pharmaceutical and genetic processes (Bauer and Gaskell 2002). The definition, however, is evolving with recent scientific advancements. Until World War II, biotechnology was primarily siloed in agricultural biology and chemical engineering. The results of this era included disease-resistant crops, pesticides, and other pest-controlling tools (Verma et al. 2011). After WWII, biotechnology began to shift domains when advanced research on human genetics and DNA started. In 1984, the Human Genome Project (HGP) was formerly proposed, which initiated the pursuit to decode the human genome by the private and academic sectors. The legacy of the project gave rise to ancillary advancements in data sharing and open-source software, and solidified the prominence of “big science;” solidifying capital-intensive large-scale private-public research initiatives that were once primarily under the purview of government-funded programs (Hood and Rowen 2013). After the HGP, the biotechnology industry boomed as a result of dramatic cost reduction to DNA sequencing processes. In 2019 the industry was globally estimated to be worth $449.06 billion and is projected to increase in value (Polaris 2020).

Download Full-text

Post-selection Inference Following Aggregate Level Hypothesis Testing in Large Scale Genomic Data

10.1101/058404 ◽

2016 ◽

Cited By ~ 1

Author(s):

Ruth Heller ◽

Nilanjan Chatterjee ◽

Abba Krieger ◽

Jianxin Shi

Keyword(s):

Hypothesis Testing ◽

Large Scale ◽

Error Rates ◽

The Cancer Genome Atlas ◽

Eqtl Analysis ◽

Aggregate Level ◽

Test Statistic ◽

Cancer Genome Atlas ◽

Using Data ◽

The Individual

AbstractIn many genomic applications, hypotheses tests are performed by aggregating test-statistics across units within naturally defined classes for powerful identification of signals. Following class-level testing, it is naturally of interest to identify the lower level units which contain true signals. Testing the individual units within a class without taking into account the fact that the class was selected using an aggregate-level test-statistic, will produce biased inference. We develop a hypothesis testing framework that guarantees control for false positive rates conditional on the fact that the class was selected. Specifically, we develop procedures for calculating unit level p-values that allows rejection of null hypotheses controlling for two types of conditional error rates, one relating to family wise rate and the other relating to false discovery rate. We use simulation studies to illustrate validity and power of the proposed procedure in comparison to several possible alternatives. We illustrate the power of the method in a natural application involving whole-genome expression quantitative trait loci (eQTL) analysis across 17 tissue types using data from The Cancer Genome Atlas (TCGA) Project.

Download Full-text

Gene trap strategies in ES cells

Gene Targeting ◽

10.1093/oso/9780199637928.003.0010 ◽

1999 ◽

Author(s):

Wolfgang Wurst ◽

Achim Gossler

Keyword(s):

Human Genome ◽

Reporter Gene ◽

Human Genome Project ◽

Large Scale ◽

Es Cells ◽

Gene Trap ◽

Genome Project ◽

Endogenous Gene ◽

The Human Genome Project

Gene trap (GT) strategies in mouse embryonic stem (ES) cells are increasingly being used for detecting patterns of gene expression (1-4, isolating and mutating endogenous genes (5-7), and identifying targets of signalling molecules and transcription factors (3, 8-10). The general term gene trap refers to the random integration of a reporter gene construct (called entrapment vector) (11, 12) into the genome such that ‘productive’ integration events bring the reporter gene under the transcriptional regulation of an endogenous gene. In some cases this also simultaneously generates an insertional mutation. Entrapment vectors were originally developed in bacteria (13), and applied in Drosophila to identify novel developmental genes and/or regulatory sequences (14-17). Subsequently, a modified strategy was developed for mouse in which the reporter gene mRNA becomes fused to an endogenous transcript. Such ‘gene trap’ vectors were initially used primarily as a tool to discover genes involved in development (1, 2,18). In the last five years there has been a significant shift of GT approaches in mouse to much broader, large scale applications in the context of the analysis of mammalian genomes and ‘functional genomics’. Sequencing and physical mapping of both the human and mouse genomes is expected to be completed within the next five years. Already, a large number of mouse and human genes have been identified as expressed sequence tags (ESTs), and very likely the majority of genes will be discovered as ESTs shortly. This vast sequence information contrasts with a rather limited understanding of the in vivo functions of these genes. Whereas DNA sequence can provide some indication of the potential functions of these genes and their products, their physiological roles in the organism have to be determined by mutational analysis. Thus, the sequencing effort of the human genome project has to be complemented by efficient functional analyses of the identified genes. One potentially powerful complementation to the efforts of the human genome project would be a strategy whereby large scale random mutagenesis in mouse is combined with the rapid identification of the mutated genes (6,7,19, and German gene trap consortium, W. W. unpublished data).

Download Full-text