A brief guide to computer intensive statistics

Mathematical Tools for Understanding Infectious Disease Dynamics ◽

10.23943/princeton/9780691155395.003.0015 ◽

2012 ◽

Author(s):

Odo Diekmann ◽

Hans Heesterbeek ◽

Tom Britton

Keyword(s):

Large Scale ◽

Large Data ◽

Control Measures ◽

Large Data Sets ◽

The Other ◽

Model Parameters ◽

Data Sets ◽

Initial Growth ◽

Simulation Methods ◽

Large Scale Simulations

Chapters 5, 13 and 14 presented methods for making inference about infectious diseases from available data. This is of course one of the main motivations for modeling: learning about important features, such as R₀, the initial growth rate, potential outbreak sizes and what effect different control measures might have in the context of specific infections. The models considered in these chapters have all been simple enough to obtain more or less explicit estimates of just a few relevant parameters. In more complicated and parameter-rich models, and/or when analyzing large data sets, it is usually impossible to estimate key model parameters explicitly. In such situations there are (at least) two ways to proceed. One uses Bayesian statistical inference by means of Markov chain Monte Carlo methods (MCMC), and the other uses large scale simulations along with numerical optimization to fit parameters to data. This chapter mainly describes Bayesian inference using MCMC and only briefly some large simulation methods.

Download Full-text

Elaborations for Part I

Mathematical Tools for Understanding Infectious Disease Dynamics ◽

10.23943/princeton/9780691155395.003.0016 ◽

2012 ◽

Author(s):

Odo Diekmann ◽

Hans Heesterbeek ◽

Tom Britton

Keyword(s):

Large Scale ◽

Large Data ◽

Control Measures ◽

Large Data Sets ◽

The Other ◽

Model Parameters ◽

Data Sets ◽

Initial Growth ◽

Simulation Methods ◽

Large Scale Simulations

Chapters 5, 13 and 14 presented methods for making inference about. infectious diseases from available data. This is of course one of the main motivations for modeling: learning about important features, such as R, the initial growth rate, potential outbreak sizes and what effect different control measures might have in the context of specific infections. The models considered in these chapters have all been simple enough to obtain more or less explicit estimates. of just a few relevant parameters. In more complicated and parameter-rich models, and/or when analyzing large data sets, it is usually impossible to estimate key model parameters explicitly. In such situations there are (at least) two ways to proceed. One uses Bayesian statistical inference by means of Markov chain Monte Carlo methods (MCMC), and the other uses large scale simulations along with numerical optimization to fit parameters to data. This chapter mainly describes Bayesian inference using MCMC and only briefly some large simulation methods.

Download Full-text

Compartment and hub definitions tune metabolic networks for metabolomic interpretations

GigaScience ◽

10.1093/gigascience/giz137 ◽

2020 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

T Cameron Waller ◽

Jordan A Berg ◽

Alexander Lex ◽

Brian E Chapman ◽

Jared Rutter

Keyword(s):

Large Scale ◽

Metabolic Networks ◽

Shortest Paths ◽

Large Data ◽

Differential Regulation ◽

Large Data Sets ◽

Data Sets ◽

Human Metabolism ◽

Experimental Conditions ◽

Systemic Model

Abstract Background Metabolic networks represent all chemical reactions that occur between molecular metabolites in an organism’s cells. They offer biological context in which to integrate, analyze, and interpret omic measurements, but their large scale and extensive connectivity present unique challenges. While it is practical to simplify these networks by placing constraints on compartments and hubs, it is unclear how these simplifications alter the structure of metabolic networks and the interpretation of metabolomic experiments. Results We curated and adapted the latest systemic model of human metabolism and developed customizable tools to define metabolic networks with and without compartmentalization in subcellular organelles and with or without inclusion of prolific metabolite hubs. Compartmentalization made networks larger, less dense, and more modular, whereas hubs made networks larger, more dense, and less modular. When present, these hubs also dominated shortest paths in the network, yet their exclusion exposed the subtler prominence of other metabolites that are typically more relevant to metabolomic experiments. We applied the non-compartmental network without metabolite hubs in a retrospective, exploratory analysis of metabolomic measurements from 5 studies on human tissues. Network clusters identified individual reactions that might experience differential regulation between experimental conditions, several of which were not apparent in the original publications. Conclusions Exclusion of specific metabolite hubs exposes modularity in both compartmental and non-compartmental metabolic networks, improving detection of relevant clusters in omic measurements. Better computational detection of metabolic network clusters in large data sets has potential to identify differential regulation of individual genes, transcripts, and proteins.

Download Full-text

Not all written in stone: interdisciplinary syntheses in echinoderm paleontology

Canadian Journal of Zoology ◽

10.1139/z00-217 ◽

2001 ◽

Vol 79 (7) ◽

pp. 1209-1231 ◽

Cited By ~ 16

Author(s):

Rich Mooi

Keyword(s):

Evolutionary History ◽

Large Scale ◽

Body Wall ◽

Developmental Trajectories ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Primary Target ◽

The Past ◽

Broad Scale

The fossil record of the Echinodermata is relatively complete, and is represented by specimens retaining an abundance of features comparable to that found in extant forms. This yields a half-billion-year record of evolutionary novelties unmatched in any other major group, making the Echinodermata a primary target for studies of biological change. Not all of this change can be understood by studying the rocks alone, leading to synthetic research programs. Study of literature from the past 20 years indicates that over 1400 papers on echinoderm paleontology appeared in that time, and that overall productivity has remained almost constant. Analysis of papers appearing since 1990 shows that research is driven by new finds including, but not restricted to, possible Precambrian echinoderms, bizarre new edrioasteroids, early crinoids, exquisitely preserved homalozoans, echinoids at the K-T boundary, and Antarctic echinoids, stelleroids, and crinoids. New interpretations of echinoderm body wall homologies, broad-scale syntheses of embryological information, the study of developmental trajectories through molecular markers, and the large-scale ecological and phenotypic shifts being explored through morphometry and analyses of large data sets are integrated with study of the fossils themselves. Therefore, recent advances reveal a remarkable and continuing synergistic expansion in our understanding of echinoderm evolutionary history.

Download Full-text

Performance Optimization System for Hadoop and Spark Frameworks

Cybernetics and Information Technologies ◽

10.2478/cait-2020-0056 ◽

2020 ◽

Vol 20 (6) ◽

pp. 5-17

Author(s):

Hrachya Astsatryan ◽

Aram Kocharyan ◽

Daniel Hagimont ◽

Arthur Lalayan

Keyword(s):

Performance Optimization ◽

Large Scale ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Apache Hadoop ◽

Compression Factor ◽

Large Scale Data ◽

Additional Processing ◽

Mapreduce Model

AbstractThe optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disks and memory but requires additional processing. Therefore, finding an optimal tradeoff is a challenge, as a high compression factor may underload Input/Output but overload the processor. The paper aims to present a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures based on simulation analyzes.

Download Full-text

Large-scale sequence comparisons with sourmash

10.1101/687285 ◽

2019 ◽

Author(s):

N. Tessa Pierce ◽

Luiz Irber ◽

Taylor Reiter ◽

Phillip Brooks ◽

C. Titus Brown

Keyword(s):

Software Package ◽

Large Scale ◽

Sequence Similarity ◽

Protein Sequences ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Sequence Comparisons ◽

Large Databases ◽

Scale Sequence

The sourmash software package uses MinHash-based sketching to create “signatures”, compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under the BSD license at http://github.com/dib-lab/sourmash.

Download Full-text

Space-Time Unit-Level EBLUP for Large Data Sets

Journal of Official Statistics ◽

10.1515/jos-2017-0004 ◽

2017 ◽

Vol 33 (1) ◽

pp. 61-77 ◽

Cited By ~ 2

Author(s):

Michele D’Aló ◽

Stefano Falorsi ◽

Fabrizio Solari

Keyword(s):

Mixed Models ◽

Small Area ◽

Large Scale ◽

Large Data ◽

Cross Product ◽

Large Data Sets ◽

Data Sets ◽

Unit Level ◽

Product Estimator ◽

New Formulation

Abstract Most important large-scale surveys carried out by national statistical institutes are the repeated survey type, typically intended to produce estimates for several parameters of the whole population, as well as parameters related to some subpopulations. Small area estimation techniques are becoming more and more important for the production of official statistics where direct estimators are not able to produce reliable estimates. In order to exploit data from different survey cycles, unit-level linear mixed models with area and time random effects can be considered. However, the large amount of data to be processed may cause computational problems. To overcome the computational issues, a reformulation of predictors and the correspondent mean cross product estimator is given. The R code based on the new formulation enables the elaboration of about 7.2 millions of data records in a matter of minutes.

Download Full-text

Large-scale sequence comparisons with sourmash

F1000Research ◽

10.12688/f1000research.19675.1 ◽

2019 ◽

Vol 8 ◽

pp. 1006 ◽

Cited By ~ 18

Author(s):

N. Tessa Pierce ◽

Luiz Irber ◽

Taylor Reiter ◽

Phillip Brooks ◽

C. Titus Brown

Keyword(s):

Software Package ◽

Large Scale ◽

Sequence Similarity ◽

Protein Sequences ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Sequence Comparisons ◽

Large Databases ◽

Scale Sequence

Download Full-text

DESIGN AND IMPLEMENTATION OF A FOVEAL PROJECTION DISPLAY

International Journal of Image and Graphics ◽

10.1142/s0219467808003076 ◽

2008 ◽

Vol 08 (02) ◽

pp. 243-263 ◽

Cited By ~ 1

Author(s):

BENJAMIN A. AHLBORN ◽

OLIVER KREYLOS ◽

SOHAIL SHAFII ◽

BERND HAMANN ◽

OLIVER G. STAADT

Keyword(s):

High Resolution ◽

Large Scale ◽

Image Data ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Resolution Image ◽

High Resolution Image ◽

Projection Display ◽

Display Resolution

We introduce a system that adds a foveal inset to large-scale projection displays. The effective resolution of the foveal inset projection is higher than the original display resolution, allowing the user to see more details and finer features in large data sets. The foveal inset is generated by projecting a high-resolution image onto a mirror mounted on a panCtilt unit that is controlled by the user with a laser pointer. Our implementation is based on Chromium and supports many OpenGL applications without modifications.We present experimental results using high-resolution image data from medical imaging and aerial photography.

Download Full-text

Haplotype Classification Using Copy Number Variation and Principal Components Analysis

The Open Bioinformatics Journal ◽

10.2174/1875036201307010019 ◽

2013 ◽

Vol 7 (1) ◽

pp. 19-24

Author(s):

Kevin Blighe

Keyword(s):

Principal Components Analysis ◽

Principal Components ◽

Large Scale ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Reduction Techniques ◽

Number Variation ◽

Components Analysis

Elaborate downstream methods are required to analyze large microarray data-sets. At times, where the end goal is to look for relationships between (or patterns within) different subgroups or even just individual samples, large data-sets must first be filtered using statistical thresholds in order to reduce their overall volume. As an example, in anthropological microarray studies, such ‘dimension reduction’ techniques are essential to elucidate any links between polymorphisms and phenotypes for given populations. In such large data-sets, a subset can first be taken to represent the larger data-set. For example, polling results taken during elections are used to infer the opinions of the population at large. However, what is the best and easiest method of capturing a sub-set of variation in a data-set that can represent the overall portrait of variation? In this article, principal components analysis (PCA) is discussed in detail, including its history, the mathematics behind the process, and in which ways it can be applied to modern large-scale biological datasets. New methods of analysis using PCA are also suggested, with tentative results outlined.

Download Full-text

Sectarianism in the Scottish Labour Market; what the 2011 Census Shows

Scottish Affairs ◽

10.3366/scot.2017.0176 ◽

2017 ◽

Vol 26 (2) ◽

pp. 163-175 ◽

Cited By ~ 3

Author(s):

Steve Bruce ◽

Tony Glendinning

Keyword(s):

Social Class ◽

Labour Market ◽

Census Data ◽

Large Data ◽

Large Data Sets ◽

The Other ◽

Data Sets ◽

The West ◽

Registrar General ◽

Analyse Data

As disadvantage can have causes other than discrimination, its presence cannot prove discrimination. However, the absence of patterns of disadvantage in large data sets would be very strong evidence against the presence of sectarian discrimination. In this paper we analyse data on religion, social class, education, gender and region from the 2011 Scottish census. Against those who argue that sectarianism is endemic in the west of Scotland, we find no sectarian association between religion and social class among people at the peak age of their labour market involvement. The class profiles of people in the Other Religion categories are unusual but the profile for Catholics is pretty much the same as for Other Christians. That this analysis involves 487,694 people gives us confidence that the results are robust. Hence we conclude there is no evidence that the Scottish labour market is characterised by sectarian discrimination. We would like to acknowledge the assistance of the staff of the office of the Registrar General for Scotland who kindly provided us with the census data.

Download Full-text