Genomic variants among threatened Acropora corals

ABSTRACTGenomic sequence data for non-model organisms are increasingly available requiring the development of efficient and reproducible workflows. Here, we develop the first genomic resources and reproducible workflows for two threatened members of the reef-building coral genus Acropora. We generated genomic sequence data from multiple samples of the Caribbean A. cervicornis (staghorn coral) and A. palmata (elkhorn coral), and predicted millions of nucleotide variants among these two species and the Pacific A. digitifera. A subset of predicted nucleotide variants were verified using restriction length polymorphism assays and proved useful in distinguishing the two Caribbean Acroporids and the hybrid they form (“A. prolifera”). Nucleotide variants are freely available from the Galaxy server (usegalaxy.org), and can be analyzed there with computational tools and stored workflows that require only an internet browser. We describe these data and some of the analysis tools, concentrating on fixed differences between A. cervicornis and A. palmata. In particular, we found that fixed amino acid differences between these two species were enriched in proteins associated with development, cellular stress response and the host’s interactions with associated microbes, for instance in the Wnt pathway, ABC transporters and superoxide dismutase. Identified candidate genes may underlie functional differences in the way these threatened species respond to changing environments. Users can expand the presented analyses easily by adding genomic data from additional species as they become available.Article SummaryWe provide the first comprehensive genomic resources for two threatened Caribbean reef-building corals in the genus Acropora. We identified genetic differences in key pathways and genes known to be important in the animals’ response to the environmental disturbances and larval development. We further provide a list of candidate loci for large scale genotyping of these species to gather intra- and interspecies differences between A. cervicornis and A. palmata across their geographic range. All analyses and workflows are made available and can be used as a resource to not only analyze these corals but other non-model organisms.

Download Full-text

Large‐scale genomic sequence data resolve the deepest divergences in the legume phylogeny and support a near‐simultaneous evolutionary origin of all six subfamilies

New Phytologist ◽

10.1111/nph.16290 ◽

2019 ◽

Vol 225 (3) ◽

pp. 1355-1369 ◽

Cited By ~ 12

Author(s):

Erik J. M. Koenen ◽

Dario I. Ojeda ◽

Royce Steeves ◽

Jérémy Migliore ◽

Freek T. Bakker ◽

...

Keyword(s):

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Evolutionary Origin

Download Full-text

Searching more genomic sequence with less memory for fast and accurate metagenomic profiling

10.1101/036681 ◽

2016 ◽

Author(s):

Shea N Gardner ◽

Sasha K Ames ◽

Maya B Gokhale ◽

Tom R Slezak ◽

Jonathan Allen

Keyword(s):

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Low Cost ◽

False Negative ◽

Human Microbiome ◽

Human Microbiome Project ◽

Metagenomic Data ◽

Reference Database ◽

Metagenomic Sequence

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.

Download Full-text

The Origin and Early Evolution of the Legumes are a Complex Paleopolyploid Phylogenomic Tangle closely associated with the Cretaceous-Paleogene (K-Pg) Boundary

10.1101/577957 ◽

2019 ◽

Cited By ~ 3

Author(s):

Erik J.M. Koenen ◽

Dario I. Ojeda ◽

Royce Steeves ◽

Jérémy Migliore ◽

Freek T. Bakker ◽

...

Keyword(s):

Mass Extinction ◽

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Incomplete Lineage Sorting ◽

Nuclear Gene ◽

Early Evolution ◽

Large Set ◽

Gene Trees ◽

The Family

AbstractThe consequences of the Cretaceous-Paleogene (K-Pg) boundary (KPB) mass extinction for the evolution of plant diversity are poorly understood, even although evolutionary turnover of plant lineages at the KPB is central to understanding the assembly of the Cenozoic biota. One aspect that has received considerable attention is the apparent concentration of whole genome duplication (WGD) events around the KPB, which may have played a role in survival and subsequent diversification of plant lineages. In order to gain new insights into the origins of Cenozoic biodiversity, we examine the origin and early evolution of the legume family, one of the most important angiosperm clades that rose to prominence after the KPB and for which multiple WGD events are found to have occurred early in its evolution. The legume family (Leguminosae or Fabaceae), with c. 20.000 species, is the third largest family of Angiospermae, and is globally widespread and second only to the grasses (Poaceae) in economic importance. Accordingly, it has been intensively studied in botanical, systematic and agronomic research, but a robust phylogenetic framework and timescale for legume evolution based on large-scale genomic sequence data is lacking, and key questions about the origin and early evolution of the family remain unresolved. We extend previous phylogenetic knowledge to gain insights into the early evolution of the family, analysing an alignment of 72 protein-coding chloroplast genes and a large set of nuclear genomic sequence data, sampling thousands of genes. We use a concatenation approach with heterogeneous models of sequence evolution to minimize inference artefacts, and evaluate support and conflict among individual nuclear gene trees with internode certainty calculations, a multi-species coalescent method, and phylogenetic supernetwork reconstruction. Using a set of 20 fossil calibrations we estimate a revised timeline of legume evolution based on a selection of genes that are both informative and evolving in an approximately clock-like fashion. We find that the root of the family is particularly difficult to resolve, with strong conflict among gene trees suggesting incomplete lineage sorting and/or reticulation. Mapping of duplications in gene family trees suggest that a WGD event occurred along the stem of the family and is shared by all legumes, with additional nested WGDs subtending subfamilies Papilionoideae and Detarioideae. We propose that the difficulty of resolving the root of the family is caused by a combination of ancient polyploidy and an alternation of long and very short internodes, shaped respectively by extinction and rapid divergence. Our results show that the crown age of the legumes dates back to the Maastrichtian or Paleocene and suggests that it is most likely close to the KPB. We conclude that the origin and early evolution of the legumes followed a complex history, in which multiple nested polyploidy events coupled with rapid diversification are associated with the mass extinction event at the KPB, ultimately underpinning the evolutionary success of the Leguminosae in the Cenozoic.

Download Full-text

Glutton: large-scale integration of non-model organism transcriptome data for comparative analysis

10.1101/077511 ◽

2016 ◽

Cited By ~ 2

Author(s):

Alan Medlar ◽

Laura Laakso ◽

Andreia Miraldo ◽

Ari Löytynoja

Keyword(s):

Comparative Analysis ◽

Large Scale ◽

De Novo ◽

Sequence Data ◽

Model Organism ◽

Model Organisms ◽

Rna Seq ◽

Reference Species ◽

Wide Range ◽

The Impact

AbstractHigh-throughput RNA-seq data has become ubiquitous in the study of non-model organisms, but its use in comparative analysis remains a challenge. Without a reference genome for mapping, sequence data has to be de novo assembled, producing large numbers of short, highly redundant contigs. Preparing these assemblies for comparative analyses requires the removal of redundant isoforms, assignment of orthologs and converting fragmented transcripts into gene alignments. In this article we present Glutton, a novel tool to process transcriptome assemblies for downstream evolutionary analyses. Glutton takes as input a set of fragmented, possibly erroneous transcriptome assemblies. Utilising phylogeny-aware alignment and reference data from a closely related species, it reconstructs one transcript per gene, finds orthologous sequences and produces accurate multiple alignments of coding sequences. We present a comprehensive analysis of Glutton’s performance across a wide range of divergence times between study and reference species. We demonstrate the impact choice of assembler has on both the number of alignments and the correctness of ortholog assignment and show substantial improvements over heuristic methods, without sacrificing correctness. Finally, using inference of Darwinian selection as an example of downstream analysis, we show that Glutton-processed RNA-seq data give results comparable to those obtained from full length gene sequences even with distantly related reference species. Glutton is available from http://wasabiapp.org/software/glutton/ and is licensed under the GPLv3.

Download Full-text

Development of Self-Compressing BLSOM for Comprehensive Analysis of Big Sequence Data

BioMed Research International ◽

10.1155/2015/506052 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8 ◽

Cited By ~ 2

Author(s):

Akihito Kikuchi ◽

Toshimichi Ikemura ◽

Takashi Abe

Keyword(s):

High Performance ◽

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Bacterial Genome ◽

Computation Time ◽

Comprehensive Analysis ◽

Self Organizing Map ◽

Genome Sequences ◽

Oligonucleotide Composition

With the remarkable increase in genomic sequence data from various organisms, novel tools are needed for comprehensive analyses of available big sequence data. We previously developed a Batch-Learning Self-Organizing Map (BLSOM), which can cluster genomic fragment sequences according to phylotype solely dependent on oligonucleotide composition and applied to genome and metagenomic studies. BLSOM is suitable for high-performance parallel-computing and can analyze big data simultaneously, but a large-scale BLSOM needs a large computational resource. We have developed Self-Compressing BLSOM (SC-BLSOM) for reduction of computation time, which allows us to carry out comprehensive analysis of big sequence data without the use of high-performance supercomputers. The strategy of SC-BLSOM is to hierarchically construct BLSOMs according to data class, such as phylotype. The first-layer BLSOM was constructed with each of the divided input data pieces that represents the data subclass, such as phylotype division, resulting in compression of the number of data pieces. The second BLSOM was constructed with a total of weight vectors obtained in the first-layer BLSOMs. We compared SC-BLSOM with the conventional BLSOM by analyzing bacterial genome sequences. SC-BLSOM could be constructed faster than BLSOM and cluster the sequences according to phylotype with high accuracy, showing the method’s suitability for efficient knowledge discovery from big sequence data.

Download Full-text

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

10.1101/2020.01.26.920173 ◽

2020 ◽

Cited By ~ 1

Author(s):

Martin Steinegger ◽

Steven L Salzberg

Keyword(s):

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Model Organism ◽

Taxonomic Composition ◽

Reference Sequence ◽

Metagenomic Sequencing ◽

Protein Database ◽

Input Size ◽

Reference Databases

Metagenomic sequencing allows researchers to investigate organisms sampled from their native environments by sequencing their DNA directly, and then quantifying the abundance and taxonomic composition of the organisms thus captured. However, these types of analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here we describe Conterminator, an efficient method to detect and remove incorrectly labelled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination in 114,035 sequences and 2767 species in the NCBI Reference Sequence Database (RefSeq), 2,161,746 sequences and 6795 species in the GenBank database, and 14,132 protein sequences in the NR non-redundant protein database. Conterminator uncovers contamination in sequences spanning the whole range from draft genomes to “complete” model organism genomes. Our method, which scales linearly with input size, was able to process 3.3 terabytes of genomic sequence data in 12 days on a single 32-core compute node. We believe that Conterminator can become an important tool to ensure the quality of reference databases with particular importance for downstream metagenomic analyses. Source code (GPLv3): https://github.com/martin-steinegger/conterminator

Download Full-text

An Integrated Pipeline of Open Source Software Adapted for Multi-CPU Architectures: Use in the Large-Scale Identification of Single Nucleotide Polymorphisms

Comparative and Functional Genomics ◽

10.1155/2007/35604 ◽

2007 ◽

Vol 2007 ◽

pp. 1-7 ◽

Cited By ~ 1

Author(s):

B. Jayashree ◽

Manindra S. Hanspal ◽

Rajgopal Srinivasan ◽

R. Vigneshwaran ◽

Rajeev K. Varshney ◽

...

Keyword(s):

Single Nucleotide Polymorphisms ◽

Open Source ◽

Open Source Software ◽

Large Scale ◽

Sequence Data ◽

Snp Genotyping ◽

Model Organisms ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Web Interfaces

The large amounts of EST sequence data available from a single species of an organism as well as for several species within a genus provide an easy source of identification of intra- and interspecies single nucleotide polymorphisms (SNPs). In the case of model organisms, the data available are numerous, given the degree of redundancy in the deposited EST data. There are several available bioinformatics tools that can be used to mine this data; however, using them requires a certain level of expertise: the tools have to be used sequentially with accompanying format conversion and steps like clustering and assembly of sequences become time-intensive jobs even for moderately sized datasets. We report here a pipeline of open source software extended to run on multiple CPU architectures that can be used to mine large EST datasets for SNPs and identify restriction sites for assaying the SNPs so that cost-effective CAPS assays can be developed for SNP genotyping in genetics and breeding applications. At the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), the pipeline has been implemented to run on a Paracel high-performance system consisting of four dual AMD Opteron processors running Linux with MPICH. The pipeline can be accessed through user-friendly web interfaces at http://hpc.icrisat.cgiar.org/PBSWeb and is available on request for academic use. We have validated the developed pipeline by mining chickpea ESTs for interspecies SNPs, development of CAPS assays for SNP genotyping, and confirmation of restriction digestion pattern at the sequence level.

Download Full-text

Application of genotyping-by-sequencing data on inferring the phylogeny of Curcuma (Zingiberaceae) from China

10.21203/rs.2.15210/v1 ◽

2019 ◽

Author(s):

Heng Liang ◽

Yan Zhang ◽

Jiabing Deng ◽

Gang Gao ◽

Chunbang Ding ◽

...

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Trees ◽

Large Scale ◽

Reference Genome ◽

Genomic Sequence ◽

Sequence Data ◽

Morphological Differentiation ◽

Genotyping By Sequencing ◽

Tibet Plateau ◽

Sequencing Data

Abstract Background: Genotyping-by-sequencing (GBS), as one of the next generation sequences, has been applied to large scale genotyping in plants, which is poor in morphological differentiation and low in genetic divergence among different species. Curcuma is a significantly medicinal and edible genus. Improvement efforts of phylogenetic relationships and disentangling species are still a challenge due to poor morphology and lack in a reference genome. Result: A high-throughput genomic sequence data which was obtained through GBS protocols was used to investigate the relationships among 8 species with 60 total samples of Curcuma. Through the use of the ipyrad software, 437,061 loci and 997,988 filtered SNPs without reliance upon a reference genome were produced. After quality control (QC) of the filtered SNPs, 1,295 high-quality SNPs were used to clarify the phylogenetic relationships among Curcuma species. Based on these data, a supermatrix approach was used to speculate the phylogeny, and the phylogenetic trees and the relationships were inferred . Conclusions: Varying degrees of support can be explained, as well as the diversification events for Chinese Curcuma. The diversification events showed that the third intense uplift of Qinghai–Tibet Plateau (QTP) and formation of the Hengduan Mountains may speed up Curcuma interspecific divergence in China. The PCA suggested the same topology of the phylogenetic tree. The genetic structure analysis revealed that extensive hybridization may exist in Chinese Curcuma. Additionally, the GBS will be a promising approach for the phylogenetic and systematic study in the future.

Download Full-text

Fast and accurate statistical inference of phylogenetic networks using large-scale genomic sequence data

10.1101/132795 ◽

2017 ◽

Cited By ~ 1

Author(s):

Hussein A. Hejase ◽

Natalie VandePol ◽

Gregory M. Bonito ◽

Kevin J. Liu

Keyword(s):

Gene Flow ◽

Large Scale ◽

Genomic Sequence ◽

State Of The Art ◽

Sequence Data ◽

Phylogenetic Network ◽

Phylogenetic Networks ◽

Divide And Conquer ◽

Performance Study ◽

Art Methods

AbstractAn emerging discovery in phylogenomics is that interspecific gene flow has played a major role in the evolution of many different organisms. To what extent is the Tree of Life not truly a tree reflecting strict “vertical” divergence, but rather a more general graph structure known as a phylogenetic network which also captures “horizontal”gene flow? The answer to this fundamental question not only depends upon densely sampled and divergent genomic sequence data, but also compu-tational methods which are capable of accurately and efficiently inferring phylogenetic networks from large-scale genomic sequence datasets. Re-cent methodological advances have attempted to address this gap. How-ever, in the 2016 performance study of Hejase and Liu, state-of-the-art methods fell well short of the scalability requirements of existing phy-logenomic studies.The methodological gap remains: how can phylogenetic networks be ac-curately and efficiently inferred using genomic sequence data involving many dozens or hundreds of taxa? In this study, we address this gap by proposing a new phylogenetic divide-and-conquer method which we call FastNet. We conduct a performance study involving a range of evolu-tionary scenarios, and we demonstrate that FastNet outperforms state-of-the-art methods in terms of computational efficiency and topological accuracy.

Download Full-text

ENU Large-scale Mutagenesis and Quantitative Trait Linkage (QTL) Analysis in Mice: Novel Technologies for Searching Polygenetic Determinants of Craniofacial Abnormalities

Critical Reviews in Oral Biology & Medicine ◽

10.1177/154411130301400503 ◽

2003 ◽

Vol 14 (5) ◽

pp. 320-330 ◽

Cited By ~ 10

Author(s):

Ichiro Nishimura ◽

Thomas A. Drake ◽

Aldons J. Lusis ◽

Karen M. Lyons ◽

Joseph H. Nadeau ◽

...

Keyword(s):

Qtl Analysis ◽

Quantitative Trait ◽

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Craniofacial Abnormalities ◽

Novel Technologies ◽

Size And Shape ◽

Quantitative Trait Linkage ◽

Mapping Techniques

Discrepancies in size and shape of the jaws are the underlying etiology in many orthodontic and orthognathic surgery patients. Genetic factors combined with environmental interactions have been postulated to play a causal or contributory role in these craniofacial abnormalities. Along with the soon-to-be-available complete human and mouse genomic sequence data, mouse mutants have become a valuable tool in the functional mapping of genes involved in the development of human maxillofacial dysmorphologies. We review two powerful methods in such efforts: N-ethyl-N-nitrosourea (ENU) large-scale mutagenesis and quantitative trait linkage (QTL) analysis. The former aims at producing a plethora of novel variants of particular trait(s), and ultimately mapping the point mutations responsible for the appearance of these new traits. In contrast, the latter applies intensive breeding and mapping techniques to identify multiple loci (and, subsequently, genes) contributing to the phenotypic difference between the tested strains. A prerequisite for either approach to studying variations in the traits of interest is the application of effective mouse cephalometric phenotype analysis and rapid DNA mapping techniques. These approaches will produce a wealth of new data on critical genes that influence the size and shape of the human face.

Download Full-text