AlbaTraDIS: Comparative analysis of large datasets from parallel transposon mutagenesis experiments

AbstractBackgroundBacteria have evolved over billions of years to survive in a wide range of environments. Currently, there is an incomplete understanding of the genetic basis for mechanisms underpinning survival in stressful conditions, such as the presence of anti-microbials. Transposon mutagenesis has been proven to be a powerful tool to identify genes and networks which are involved in survival and fitness under a given condition by simultaneously assaying the fitness of millions of mutants, thereby relating genotype to phenotype and contributing to an understanding of bacterial cell biology. A recent refinement of this approach allows the roles of essential genes in conditional stress survival to be inferred by altering their expression. These advancements combined with the rapidly falling costs of sequencing now allows comparisons between multiple experiments to identify commonalities in stress responses to different conditions. This capacity however poses a new challenge for analysis of multiple data sets in conjunction.ResultsTo address this analysis need, we have developed ‘AlbaTraDIS’; a software application for rapid large-scale comparative analysis of TraDIS experiments that predicts the impact of transposon insertions on nearby genes. AlbaTraDIS can identify genes which are up or down regulated, or inactivated, between multiple conditions, producing a filtered list of genes for further experimental validation as well as several accompanying data visualisations. We demonstrate the utility of our new approach by applying it to identify genes used byEscherichia colito survive in a wide range of different concentrations of the biocide Triclosan. AlbaTraDIS automatically identified all well characterised Triclosan resistance genes, including the primary target,fabI. A number of new loci were also implicated in Triclosan resistance and the predicted phenotypes for a selection of these were validated experimentally and results showed high consistency with predictions.ConclusionsAlbaTraDIS provides a simple and rapid method to analyse multiple transposon mutagenesis data sets allowing this technology to be used at large scale. To our knowledge this is the only tool currently available that can perform these tasks. AlbaTraDIS is written in Python 3 and is available under the open source licence GNU GPL 3 fromhttps://github.com/quadram-institute-bioscience/albatradis.

Download Full-text

Glutton: large-scale integration of non-model organism transcriptome data for comparative analysis

10.1101/077511 ◽

2016 ◽

Cited By ~ 2

Author(s):

Alan Medlar ◽

Laura Laakso ◽

Andreia Miraldo ◽

Ari Löytynoja

Keyword(s):

Comparative Analysis ◽

Large Scale ◽

De Novo ◽

Sequence Data ◽

Model Organism ◽

Model Organisms ◽

Rna Seq ◽

Reference Species ◽

Wide Range ◽

The Impact

AbstractHigh-throughput RNA-seq data has become ubiquitous in the study of non-model organisms, but its use in comparative analysis remains a challenge. Without a reference genome for mapping, sequence data has to be de novo assembled, producing large numbers of short, highly redundant contigs. Preparing these assemblies for comparative analyses requires the removal of redundant isoforms, assignment of orthologs and converting fragmented transcripts into gene alignments. In this article we present Glutton, a novel tool to process transcriptome assemblies for downstream evolutionary analyses. Glutton takes as input a set of fragmented, possibly erroneous transcriptome assemblies. Utilising phylogeny-aware alignment and reference data from a closely related species, it reconstructs one transcript per gene, finds orthologous sequences and produces accurate multiple alignments of coding sequences. We present a comprehensive analysis of Glutton’s performance across a wide range of divergence times between study and reference species. We demonstrate the impact choice of assembler has on both the number of alignments and the correctness of ortholog assignment and show substantial improvements over heuristic methods, without sacrificing correctness. Finally, using inference of Darwinian selection as an example of downstream analysis, we show that Glutton-processed RNA-seq data give results comparable to those obtained from full length gene sequences even with distantly related reference species. Glutton is available from http://wasabiapp.org/software/glutton/ and is licensed under the GPLv3.

Download Full-text

The Midlatitude Continental Convective Clouds Experiment (MC3E) sounding network: operations, processing and analysis

Atmospheric Measurement Techniques ◽

10.5194/amt-8-421-2015 ◽

2015 ◽

Vol 8 (1) ◽

pp. 421-434 ◽

Cited By ~ 18

Author(s):

M. P. Jensen ◽

T. Toto ◽

D. Troyan ◽

P. E. Ciesielski ◽

D. Holdridge ◽

...

Keyword(s):

Large Scale ◽

Scale Model ◽

Data Sets ◽

Central Plains ◽

Data Set ◽

Convective Systems ◽

Convective Clouds ◽

Quality Checks ◽

Network Operations ◽

The Impact

Abstract. The Midlatitude Continental Convective Clouds Experiment (MC3E) took place during the spring of 2011 centered in north-central Oklahoma, USA. The main goal of this field campaign was to capture the dynamical and microphysical characteristics of precipitating convective systems in the US Central Plains. A major component of the campaign was a six-site radiosonde array designed to capture the large-scale variability of the atmospheric state with the intent of deriving model forcing data sets. Over the course of the 46-day MC3E campaign, a total of 1362 radiosondes were launched from the enhanced sonde network. This manuscript provides details on the instrumentation used as part of the sounding array, the data processing activities including quality checks and humidity bias corrections and an analysis of the impacts of bias correction and algorithm assumptions on the determination of convective levels and indices. It is found that corrections for known radiosonde humidity biases and assumptions regarding the characteristics of the surface convective parcel result in significant differences in the derived values of convective levels and indices in many soundings. In addition, the impact of including the humidity corrections and quality controls on the thermodynamic profiles that are used in the derivation of a large-scale model forcing data set are investigated. The results show a significant impact on the derived large-scale vertical velocity field illustrating the importance of addressing these humidity biases.

Download Full-text

Considerations of the Scale of Radiocarbon Offsets in the East Mediterranean, and Considering a Case for the Latest (Most Recent) Likely Date for the Santorini Eruption

Radiocarbon ◽

10.1017/s0033822200047202 ◽

2012 ◽

Vol 54 (3-4) ◽

pp. 449-474 ◽

Cited By ~ 13

Author(s):

Sturt W Manning ◽

Bernd Kromer

Keyword(s):

Time Horizon ◽

Weighted Average ◽

17Th Century ◽

Data Sets ◽

East Mediterranean ◽

Average Value ◽

Accuracy And Precision ◽

Wide Range ◽

The Impact ◽

East Mediterranean Region

The debate over the dating of the Santorini (Thera) volcanic eruption has seen sustained efforts to criticize or challenge the radiocarbon dating of this time horizon. We consider some of the relevant areas of possible movement in the14C dating—and, in particular, any plausible mechanisms to support as late (most recent) a date as possible. First, we report and analyze data investigating the scale of apparent possible14C offsets (growing season related) in the Aegean-Anatolia-east Mediterranean region (excluding the southern Levant and especially pre-modern, pre-dam Egypt, which is a distinct case), and find no evidence for more than very small possible offsets from several cases. This topic is thus not an explanation for current differences in dating in the Aegean and at best provides only a few years of latitude. Second, we consider some aspects of the accuracy and precision of14C dating with respect to the Santorini case. While the existing data appear robust, we nonetheless speculate that examination of the frequency distribution of the14C data on short-lived samples from the volcanic destruction level at Akrotiri on Santorini (Thera) may indicate that the average value of the overall data sets is not necessarily the most appropriate14C age to use for dating this time horizon. We note the recent paper of Soter (2011), which suggests that in such a volcanic context some (small) age increment may be possible from diffuse CO2emissions (the effect is hypothetical at this stage and hasnotbeen observed in the field), and that "if short-lived samples from the same stratigraphic horizon yield a wide range of14C ages, the lower values may be the least altered by old CO2." In this context, it might be argued that a substantive “low” grouping of14C ages observable within the overall14C data sets on short-lived samples from the Thera volcanic destruction level centered about 3326–3328 BP is perhaps more representative of the contemporary atmospheric14C age (without any volcanic CO2contamination). This is a subjective argument (since, in statistical terms, the existing studies using the weighted average remain valid) that looks to support as late a date as reasonable from the14C data. The impact of employing this revised14C age is discussed. In general, a late 17th century BC date range is found (to remain) to be most likelyeven ifsuch a late-dating strategy is followed—a late 17th century BC date range is thus a robust finding from the14C evidence even allowing for various possible variation factors. However, the possibility of a mid-16th century BC date (within ∼1593–1530 cal BC) is increased when compared against previous analyses if the Santorini data are considered in isolation.

Download Full-text

“I Would Prefer to Be Famous”

International Criminal Justice Review ◽

10.1177/1057567718766404 ◽

2018 ◽

Vol 28 (4) ◽

pp. 372-390 ◽

Cited By ~ 2

Author(s):

Susanne Karstedt

Keyword(s):

World War Ii ◽

Comparative Analysis ◽

Transitional Justice ◽

International Criminal Tribunal ◽

Crimes Against Humanity ◽

Data Sets ◽

World War ◽

War Criminals ◽

Atrocity Crimes ◽

The Impact

The reentry of sentenced perpetrators of atrocity crimes is part and parcel of the pursuit of international and transitional justice. As men and women sentenced for war crimes, crimes against humanity and genocide by the International Criminal Tribunal for the former Yugoslavia (ICTY) and the other tribunals return from prisons into society and communities questions arise as to the impact their reentry has on deeply divided postconflict societies, in particular on victim groups. Contemporary international tribunals and courts mostly do not have penal or correctional policies of their own, and the legacy of early release, commuting of sentences and amnesties that Nuremberg and other post-World War II tribunals have left, is a particularly problematic one. Germany’s historical experience provides an analytic blueprint for understanding in which ways contemporary perpetrators return into changed and still fragile societies. This comparative analysis between Nuremberg and the ICTY is based on two data sets including information on returning war criminals sentenced in both tribunals. The comparative analysis focuses on four themes: politics of reentry, admission of guilt and justification, memoirs, and political activism.

Download Full-text

Comparative Cladistics: Fossils, Morphological Data Partitions and Lost Branches in the Fossil Tree of Life

10.31237/osf.io/7sa6d ◽

2017 ◽

Author(s):

Ross Mounce

Keyword(s):

Large Scale ◽

Phylogenetic Signal ◽

Scientific Progress ◽

Molecular Data ◽

Morphological Data ◽

Data Sets ◽

Phylogenetic Groups ◽

Use Of Data ◽

Wide Range ◽

Cladistic Analyses

In this thesis I attempt to gather together a wide range of cladistic analyses of fossil and extant taxa representing a diverse array of phylogenetic groups. I use this data to quantitatively compare the effect of fossil taxa relative to extant taxa in terms of support for relationships, number of most parsimonious trees (MPTs) and leaf stability. In line with previous studies I find that the effects of fossil taxa are seldom different to extant taxa – although I highlight some interesting exceptions. I also use this data to compare the phylogenetic signal within vertebrate morphological data sets, by choosing to compare cranial data to postcranial data. Comparisons between molecular data and morphological data have been previously well explored, as have signals between different molecular loci. But comparative signal within morphological data sets is much less commonly characterized and certainly not across a wide array of clades. With this analysis I show that there are many studies in which the evidence provided by cranial data appears to be be significantly incongruent with the postcranial data – more than one would expect to see just by the effect of chance and noise alone. I devise and implement a modification to a rarely used measure of homoplasy that will hopefully encourage its wider usage. Previously it had some undesirable bias associated with the distribution of missing data in a dataset, but my modification controls for this. I also take an in-depth and extensive review of the ILD test, noting it is often misused or reported poorly, even in recent studies. Finally, in attempting to collect data and metadata on a large scale, I uncovered inefficiencies in the research publication system that obstruct re-use of data and scientific progress. I highlight the importance of replication and reproducibility – even simple reanalysis of high profile papers can turn up some very different results. Data is highly valuable and thus it must be retained and made available for further re-use to maximize the overall return on research investment.

Download Full-text

Denoising large-scale biological data using network filters

10.21203/rs.3.rs-66071/v2 ◽

2021 ◽

Author(s):

Andrew J Kavran ◽

Aaron Clauset

Keyword(s):

Large Scale ◽

Synthetic Data ◽

Interaction Network ◽

Learning Task ◽

Biological Data ◽

Data Sets ◽

Proteomics Data ◽

Life History Variation ◽

Wide Range ◽

Underlying Processes

Abstract Background: Large-scale biological data sets are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation.Results: We describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “ﬁltered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network ﬁlter may be applied to an entire system, or the system may be ﬁrst decomposed into distinct modules and a diﬀerent ﬁlter applied to each. Applied to synthetic data with known network structure and signal, network ﬁlters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network ﬁltering prior to training increases accuracy up to 43% compared to using unﬁltered data.Conclusions: Network ﬁlters are a general way to denoise biological data and can account for both correlation and anti-correlation between diﬀerent measurements. Furthermore, we ﬁnd that partitioning a network prior to ﬁltering can signiﬁcantly reduce errors in networks with heterogenous data and correlation patterns, and this approach outperforms existing diﬀusion based methods. Our results on proteomics data indicate the broad potential utility of network ﬁlters to applications in systems biology.

Download Full-text

Usage and Scaling of an Open-Source Spiking Multi-Area Model of Monkey Cortex

Lecture Notes in Computer Science - Brain-Inspired Computing ◽

10.1007/978-3-030-82427-3_4 ◽

2021 ◽

pp. 47-59

Author(s):

Sacha J. van Albada ◽

Jari Pronold ◽

Alexander van Meegen ◽

Markus Diesmann

Keyword(s):

Open Source ◽

Large Scale ◽

Network Models ◽

Macaque Monkey ◽

Source Model ◽

Model Specification ◽

Data Sets ◽

Neural Network Models ◽

Wide Range ◽

Ict Infrastructure

AbstractWe are entering an age of ‘big’ computational neuroscience, in which neural network models are increasing in size and in numbers of underlying data sets. Consolidating the zoo of models into large-scale models simultaneously consistent with a wide range of data is only possible through the effort of large teams, which can be spread across multiple research institutions. To ensure that computational neuroscientists can build on each other’s work, it is important to make models publicly available as well-documented code. This chapter describes such an open-source model, which relates the connectivity structure of all vision-related cortical areas of the macaque monkey with their resting-state dynamics. We give a brief overview of how to use the executable model specification, which employs NEST as simulation engine, and show its runtime scaling. The solutions found serve as an example for organizing the workflow of future models from the raw experimental data to the visualization of the results, expose the challenges, and give guidance for the construction of an ICT infrastructure for neuroscience.

Download Full-text

Six years of total ozone column measurements from SCIAMACHY nadir observations

Atmospheric Measurement Techniques ◽

10.5194/amt-2-87-2009 ◽

2009 ◽

Vol 2 (1) ◽

pp. 87-98 ◽

Cited By ~ 39

Author(s):

C. Lerot ◽

M. Van Roozendael ◽

J. van Geffen ◽

J. van Gent ◽

C. Fayt ◽

...

Keyword(s):

Cross Sections ◽

Total Ozone ◽

Large Scale ◽

European Space Agency ◽

Data Sets ◽

Data Set ◽

Ozone Data ◽

Space Agency ◽

German Aerospace ◽

The Impact

Abstract. Total O3 columns have been retrieved from six years of SCIAMACHY nadir UV radiance measurements using SDOAS, an adaptation of the GDOAS algorithm previously developed at BIRA-IASB for the GOME instrument. GDOAS and SDOAS have been implemented by the German Aerospace Center (DLR) in the version 4 of the GOME Data Processor (GDP) and in version 3 of the SCIAMACHY Ground Processor (SGP), respectively. The processors are being run at the DLR processing centre on behalf of the European Space Agency (ESA). We first focus on the description of the SDOAS algorithm with particular attention to the impact of uncertainties on the reference O3 absorption cross-sections. Second, the resulting SCIAMACHY total ozone data set is globally evaluated through large-scale comparisons with results from GOME and OMI as well as with ground-based correlative measurements. The various total ozone data sets are found to agree within 2% on average. However, a negative trend of 0.2–0.4%/year has been identified in the SCIAMACHY O3 columns; this probably originates from instrumental degradation effects that have not yet been fully characterized.

Download Full-text

OncoThreads: Visualization of Large Scale Longitudinal Cancer Molecular Data

10.31219/osf.io/b3n4u ◽

2021 ◽

Author(s):

Theresa A Harbig ◽

Sabrina Nusrat ◽

Tali Mazor ◽

Qianwen Wang ◽

Alexander Thomson ◽

...

Keyword(s):

Longitudinal Data ◽

Large Scale ◽

Cancer Genomics ◽

Temporal Patterns ◽

Molecular Data ◽

Liquid Biopsies ◽

Sequencing Technologies ◽

Molecular Features ◽

Wide Range ◽

The Impact

Molecular profiling of patient tumors and liquid biopsies over time with next-generation sequencing technologies and new immuno-profile assays are becoming part of standard research and clinical practice. With the wealth of new longitudinal data, there is a critical need for visualizations for cancer researchers to explore and interpret temporal patterns not just in a single patient but across cohorts. To address this need we developed OncoThreads, a tool for the visualization of longitudinal clinical and cancer genomics and other molecular data in patient cohorts. The tool visualizes patient cohorts as temporal heatmaps and Sankey diagrams that support the interactive exploration and ranking of a wide range of clinical and molecular features. This allows analysts to discover temporal patterns in longitudinal data, such as the impact of mutations on response to a treatment, e.g. emergence of resistant clones. We demonstrate the functionality of OncoThreads using a cohort of 23 glioma patients sampled at 2-4 timepoints. OncoThreads is freely available at http://oncothreads.gehlenborglab.org and implemented in Javascript using the cBioPortal web API as a backend.

Download Full-text

Characterising RDF data sets

Journal of Information Science ◽

10.1177/0165551516677945 ◽

2017 ◽

Vol 44 (2) ◽

pp. 203-229 ◽

Cited By ~ 6

Author(s):

Javier D Fernández ◽

Miguel A Martínez-Prieto ◽

Pablo de la Fuente Redondo ◽

Claudio Gutiérrez

Keyword(s):

Data Structures ◽

Large Scale ◽

Open Data ◽

Structural Features ◽

Data Sets ◽

Data Set ◽

Wide Range ◽

Rdf Data ◽

Description Framework ◽

Resource Description

The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.

Download Full-text