Compartment and hub definitions tune metabolic networks for metabolomic interpretations

T Cameron Waller; Jordan A Berg; Alexander Lex; Brian E Chapman; Jared Rutter

doi:10.1093/gigascience/giz137

Compartment and hub definitions tune metabolic networks for metabolomic interpretations

GigaScience ◽

10.1093/gigascience/giz137 ◽

2020 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

T Cameron Waller ◽

Jordan A Berg ◽

Alexander Lex ◽

Brian E Chapman ◽

Jared Rutter

Keyword(s):

Large Scale ◽

Metabolic Networks ◽

Shortest Paths ◽

Large Data ◽

Differential Regulation ◽

Large Data Sets ◽

Data Sets ◽

Human Metabolism ◽

Experimental Conditions ◽

Systemic Model

Abstract Background Metabolic networks represent all chemical reactions that occur between molecular metabolites in an organism’s cells. They offer biological context in which to integrate, analyze, and interpret omic measurements, but their large scale and extensive connectivity present unique challenges. While it is practical to simplify these networks by placing constraints on compartments and hubs, it is unclear how these simplifications alter the structure of metabolic networks and the interpretation of metabolomic experiments. Results We curated and adapted the latest systemic model of human metabolism and developed customizable tools to define metabolic networks with and without compartmentalization in subcellular organelles and with or without inclusion of prolific metabolite hubs. Compartmentalization made networks larger, less dense, and more modular, whereas hubs made networks larger, more dense, and less modular. When present, these hubs also dominated shortest paths in the network, yet their exclusion exposed the subtler prominence of other metabolites that are typically more relevant to metabolomic experiments. We applied the non-compartmental network without metabolite hubs in a retrospective, exploratory analysis of metabolomic measurements from 5 studies on human tissues. Network clusters identified individual reactions that might experience differential regulation between experimental conditions, several of which were not apparent in the original publications. Conclusions Exclusion of specific metabolite hubs exposes modularity in both compartmental and non-compartmental metabolic networks, improving detection of relevant clusters in omic measurements. Better computational detection of metabolic network clusters in large data sets has potential to identify differential regulation of individual genes, transcripts, and proteins.

Download Full-text

Spatial and temporal tools for building a human cell atlas

Molecular Biology of the Cell ◽

10.1091/mbc.e18-10-0667 ◽

2019 ◽

Vol 30 (19) ◽

pp. 2435-2438 ◽

Cited By ~ 2

Author(s):

Jonah Cool ◽

Richard S. Conroy ◽

Sean E. Hanlon ◽

Shannon K. Hughes ◽

Ananda L. Roy

Keyword(s):

Scientific Community ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Experimental Conditions ◽

Modeling Tools ◽

Holistic View ◽

Multiple Timescales ◽

Single Cell Sequencing ◽

Coordination And Collaboration

Improvements in the sensitivity, content, and throughput of microscopy, in the depth and throughput of single-cell sequencing approaches, and in computational and modeling tools for data integration have created a portfolio of methods for building spatiotemporal cell atlases. Challenges in this fast-moving field include optimizing experimental conditions to allow a holistic view of tissues, extending molecular analysis across multiple timescales, and developing new tools for 1) managing large data sets, 2) extracting patterns and correlation from these data, and 3) integrating and visualizing data and derived results in an informative way. The utility of these tools and atlases for the broader scientific community will be accelerated through a commitment to findable, accessible, interoperable, and reusable data and tool sharing principles that can be facilitated through coordination and collaboration between programs working in this space.

Download Full-text

Not all written in stone: interdisciplinary syntheses in echinoderm paleontology

Canadian Journal of Zoology ◽

10.1139/z00-217 ◽

2001 ◽

Vol 79 (7) ◽

pp. 1209-1231 ◽

Cited By ~ 16

Author(s):

Rich Mooi

Keyword(s):

Evolutionary History ◽

Large Scale ◽

Body Wall ◽

Developmental Trajectories ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Primary Target ◽

The Past ◽

Broad Scale

The fossil record of the Echinodermata is relatively complete, and is represented by specimens retaining an abundance of features comparable to that found in extant forms. This yields a half-billion-year record of evolutionary novelties unmatched in any other major group, making the Echinodermata a primary target for studies of biological change. Not all of this change can be understood by studying the rocks alone, leading to synthetic research programs. Study of literature from the past 20 years indicates that over 1400 papers on echinoderm paleontology appeared in that time, and that overall productivity has remained almost constant. Analysis of papers appearing since 1990 shows that research is driven by new finds including, but not restricted to, possible Precambrian echinoderms, bizarre new edrioasteroids, early crinoids, exquisitely preserved homalozoans, echinoids at the K-T boundary, and Antarctic echinoids, stelleroids, and crinoids. New interpretations of echinoderm body wall homologies, broad-scale syntheses of embryological information, the study of developmental trajectories through molecular markers, and the large-scale ecological and phenotypic shifts being explored through morphometry and analyses of large data sets are integrated with study of the fossils themselves. Therefore, recent advances reveal a remarkable and continuing synergistic expansion in our understanding of echinoderm evolutionary history.

Download Full-text

Performance Optimization System for Hadoop and Spark Frameworks

Cybernetics and Information Technologies ◽

10.2478/cait-2020-0056 ◽

2020 ◽

Vol 20 (6) ◽

pp. 5-17

Author(s):

Hrachya Astsatryan ◽

Aram Kocharyan ◽

Daniel Hagimont ◽

Arthur Lalayan

Keyword(s):

Performance Optimization ◽

Large Scale ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Apache Hadoop ◽

Compression Factor ◽

Large Scale Data ◽

Additional Processing ◽

Mapreduce Model

AbstractThe optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disks and memory but requires additional processing. Therefore, finding an optimal tradeoff is a challenge, as a high compression factor may underload Input/Output but overload the processor. The paper aims to present a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures based on simulation analyzes.

Download Full-text

Large-scale sequence comparisons with sourmash

10.1101/687285 ◽

2019 ◽

Author(s):

N. Tessa Pierce ◽

Luiz Irber ◽

Taylor Reiter ◽

Phillip Brooks ◽

C. Titus Brown

Keyword(s):

Software Package ◽

Large Scale ◽

Sequence Similarity ◽

Protein Sequences ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Sequence Comparisons ◽

Large Databases ◽

Scale Sequence

The sourmash software package uses MinHash-based sketching to create “signatures”, compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under the BSD license at http://github.com/dib-lab/sourmash.

Download Full-text

Space-Time Unit-Level EBLUP for Large Data Sets

Journal of Official Statistics ◽

10.1515/jos-2017-0004 ◽

2017 ◽

Vol 33 (1) ◽

pp. 61-77 ◽

Cited By ~ 2

Author(s):

Michele D’Aló ◽

Stefano Falorsi ◽

Fabrizio Solari

Keyword(s):

Mixed Models ◽

Small Area ◽

Large Scale ◽

Large Data ◽

Cross Product ◽

Large Data Sets ◽

Data Sets ◽

Unit Level ◽

Product Estimator ◽

New Formulation

Abstract Most important large-scale surveys carried out by national statistical institutes are the repeated survey type, typically intended to produce estimates for several parameters of the whole population, as well as parameters related to some subpopulations. Small area estimation techniques are becoming more and more important for the production of official statistics where direct estimators are not able to produce reliable estimates. In order to exploit data from different survey cycles, unit-level linear mixed models with area and time random effects can be considered. However, the large amount of data to be processed may cause computational problems. To overcome the computational issues, a reformulation of predictors and the correspondent mean cross product estimator is given. The R code based on the new formulation enables the elaboration of about 7.2 millions of data records in a matter of minutes.

Download Full-text

Large-scale sequence comparisons with sourmash

F1000Research ◽

10.12688/f1000research.19675.1 ◽

2019 ◽

Vol 8 ◽

pp. 1006 ◽

Cited By ~ 18

Author(s):

N. Tessa Pierce ◽

Luiz Irber ◽

Taylor Reiter ◽

Phillip Brooks ◽

C. Titus Brown

Keyword(s):

Software Package ◽

Large Scale ◽

Sequence Similarity ◽

Protein Sequences ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Sequence Comparisons ◽

Large Databases ◽

Scale Sequence

Download Full-text

DESIGN AND IMPLEMENTATION OF A FOVEAL PROJECTION DISPLAY

International Journal of Image and Graphics ◽

10.1142/s0219467808003076 ◽

2008 ◽

Vol 08 (02) ◽

pp. 243-263 ◽

Cited By ~ 1

Author(s):

BENJAMIN A. AHLBORN ◽

OLIVER KREYLOS ◽

SOHAIL SHAFII ◽

BERND HAMANN ◽

OLIVER G. STAADT

Keyword(s):

High Resolution ◽

Large Scale ◽

Image Data ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Resolution Image ◽

High Resolution Image ◽

Projection Display ◽

Display Resolution

We introduce a system that adds a foveal inset to large-scale projection displays. The effective resolution of the foveal inset projection is higher than the original display resolution, allowing the user to see more details and finer features in large data sets. The foveal inset is generated by projecting a high-resolution image onto a mirror mounted on a panCtilt unit that is controlled by the user with a laser pointer. Our implementation is based on Chromium and supports many OpenGL applications without modifications.We present experimental results using high-resolution image data from medical imaging and aerial photography.

Download Full-text

Haplotype Classification Using Copy Number Variation and Principal Components Analysis

The Open Bioinformatics Journal ◽

10.2174/1875036201307010019 ◽

2013 ◽

Vol 7 (1) ◽

pp. 19-24

Author(s):

Kevin Blighe

Keyword(s):

Principal Components Analysis ◽

Principal Components ◽

Large Scale ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Reduction Techniques ◽

Number Variation ◽

Components Analysis

Elaborate downstream methods are required to analyze large microarray data-sets. At times, where the end goal is to look for relationships between (or patterns within) different subgroups or even just individual samples, large data-sets must first be filtered using statistical thresholds in order to reduce their overall volume. As an example, in anthropological microarray studies, such ‘dimension reduction’ techniques are essential to elucidate any links between polymorphisms and phenotypes for given populations. In such large data-sets, a subset can first be taken to represent the larger data-set. For example, polling results taken during elections are used to infer the opinions of the population at large. However, what is the best and easiest method of capturing a sub-set of variation in a data-set that can represent the overall portrait of variation? In this article, principal components analysis (PCA) is discussed in detail, including its history, the mathematics behind the process, and in which ways it can be applied to modern large-scale biological datasets. New methods of analysis using PCA are also suggested, with tentative results outlined.

Download Full-text

Competency Mining in Large Data Sets - Preparing Large Scale Investigations in Computer Science Education

Proceedings of the International Conference on Knowledge Discovery and Information Retrieval ◽

10.5220/0005129203150322 ◽

2014 ◽

Cited By ~ 2

Author(s):

Peter Hubwieser ◽

Andreas Mühling

Keyword(s):

Science Education ◽

Computer Science ◽

Large Scale ◽

Computer Science Education ◽

Large Data ◽

Large Data Sets ◽

Data Sets

Download Full-text

High-performance epistasis detection in quantitative trait GWAS

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016658110 ◽

2016 ◽

Vol 32 (3) ◽

pp. 321-336 ◽

Cited By ~ 6

Author(s):

Nathan T Weeks ◽

Glenn R Luecke ◽

Brandon M Groth ◽

Marina Kraeva ◽

Li Ma ◽

...

Keyword(s):

Load Balancing ◽

Quantitative Trait ◽

Large Scale ◽

Association Studies ◽

Large Data ◽

Large Data Sets ◽

Gwas Data ◽

Data Sets ◽

Genome Wide Association Studies ◽

Data Set

epiSNP is a program for identifying pairwise single nucleotide polymorphism (SNP) interactions (epistasis) in quantitative-trait genome-wide association studies (GWAS). A parallel MPI version (EPISNPmpi) was created in 2008 to address this computationally expensive analysis on large data sets with many quantitative traits and SNP markers. However, the falling cost of genotyping has led to an explosion of large-scale GWAS data sets that challenge EPISNPmpi’s ability to compute results in a reasonable amount of time. Therefore, we optimized epiSNP for modern multi-core and highly parallel many-core processors to efficiently handle these large data sets. This paper describes the serial optimizations, dynamic load balancing using MPI-3 RMA operations, and shared-memory parallelization with OpenMP to further enhance load balancing and allow execution on the Intel Xeon Phi coprocessor (MIC). For a large GWAS data set, our optimizations provided a 38.43× speedup over EPISNPmpi on 126 nodes using 2 MICs on TACC’s Stampede Supercomputer. We also describe a Coarray Fortran (CAF) version that demonstrates the suitability of PGAS languages for problems with this computational pattern. We show that the Coarray version performs competitively with the MPI version on the NERSC Edison Cray XC30 supercomputer. Finally, the performance benefits of hyper-threading for this application on Edison (average 1.35× speedup) are demonstrated.

Download Full-text