Exploring the phylogeny of rosids with a five-locus supermatrix from GenBank

AbstractCurrent advances in sequencing technology have greatly increased the availability of sequence data from public genetic databases. With data from GenBank, we assemble and phylogenetically investigate a 19,740-taxon, five-locus supermatrix (i.e., atpB, rbcL, matK, matR, and ITS) for rosids, a large clade containing over 90,000 species, or approximately a quarter of all angiosperms (assuming an estimate of 400,000 angiosperm species). The topology and divergence times of the five-locus tree generally agree with previous estimates of rosid phylogeny, and we recover greater resolution and support in several areas along the rosid backbone, but with a few significant differences (e.g., the placement of the COM clade, as well as Myrtales, Vitales, and Zygophyllales). Our five-locus phylogeny is the most comprehensive DNA data set yet compiled for the rosid clade. Yet, even with 19,740 species, current sampling represents only 16-22% of all rosids, and we also find evidence of strong phylogenetic bias in the accumulation of GenBank data, highlighting continued challenges for species coverage. These limitations also exist in other major angiosperm clades (e.g., asterids, monocots) as well as other large, understudied branches of the Tree of Life, highlighting the need for broader molecular sampling. Nevertheless, the phylogeny presented here improves upon sampling by more than two-fold and will be an important resource for macroevolutionary studies of this pivotal clade.

Download Full-text

SHI7 Is a Self-Learning Pipeline for Multipurpose Short-Read DNA Quality Control

mSystems ◽

10.1128/msystems.00202-17 ◽

2018 ◽

Vol 3 (3) ◽

Cited By ~ 15

Author(s):

Gabriel A. Al-Ghalith ◽

Benjamin Hillmann ◽

Kaiwei Ang ◽

Robin Shields-Cutler ◽

Dan Knights

Keyword(s):

Quality Control ◽

Dna Sequences ◽

Sequence Data ◽

Background Knowledge ◽

Sequencing Technology ◽

Data Set ◽

Short Read ◽

Dna Quality ◽

Public Data ◽

User Friendly

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.

Download Full-text

The effect of alignment uncertainty, substitution models and priors in building and dating the mammal tree of life

BMC Evolutionary Biology ◽

10.1186/s12862-019-1534-9 ◽

2019 ◽

Vol 19 (1) ◽

Cited By ~ 1

Author(s):

Yan Du ◽

Shaoyuan Wu ◽

Scott V. Edwards ◽

Liang Liu

Keyword(s):

Divergence Time ◽

Time Estimation ◽

Species Tree ◽

Tree Of Life ◽

Divergence Times ◽

Gene Trees ◽

Data Set ◽

Divergence Time Estimation ◽

Substitution Models ◽

Alignment Uncertainty

Abstract Background The flood of genomic data to help build and date the tree of life requires automation at several critical junctures, most importantly during sequence assembly and alignment. It is widely appreciated that automated alignment protocols can yield inaccuracies, but the relative impact of various sources error on phylogenomic analysis is not yet known. This study employs an updated mammal data set of 5162 coding loci sampled from 90 species to evaluate the effects of alignment uncertainty, substitution models, and fossil priors on gene tree, species tree, and divergence time estimation. Additionally, a novel coalescent likelihood ratio test is introduced for comparing competing species trees against a given set of gene trees. Results The aligned DNA sequences of 5162 loci from 90 species were trimmed and filtered using trimAL and two filtering protocols. The final dataset contains 4 sets of alignments - before trimming, after trimming, filtered by a recently proposed pipeline, and further filtered by comparing ML gene trees for each locus with the concatenation tree. Our analyses suggest that the average discordance among the coalescent trees is significantly smaller than that among the concatenation trees estimated from the 4 sets of alignments or with different substitution models. There is no significant difference among the divergence times estimated with different substitution models. However, the divergence dates estimated from the alignments after trimming are more recent than those estimated from the alignments before trimming. Conclusions Our results highlight that alignment uncertainty of the updated mammal data set and the choice of substitution models have little impact on tree topologies yielded by coalescent methods for species tree estimation, whereas they are more influential on the trees made by concatenation. Given the choice of calibration scheme and clock models, divergence time estimates are robust to the choice of substitution models, but removing alignments deemed problematic by trimming algorithms can lead to more recent dates. Although the fossil prior is important in divergence time estimation, Bayesian estimates of divergence times in this data set are driven primarily by the sequence data.

Download Full-text

Annotated genome sequences of the carnivorous plant Roridula gorgonias and a non-carnivorous relative, Clethra arborea

BMC Research Notes ◽

10.1186/s13104-020-05254-4 ◽

2020 ◽

Vol 13 (1) ◽

Author(s):

Stefanie Hartmann ◽

Michaela Preick ◽

Silke Abelt ◽

André Scheffel ◽

Michael Hofreiter

Keyword(s):

Nuclear Genome ◽

Evolutionary Genetics ◽

Carnivorous Plant ◽

Species Tree ◽

Tree Of Life ◽

Carnivorous Plants ◽

Genome Sequences ◽

Data Set ◽

Important Resource ◽

Nuclear Genomes

Abstract Objective Plant carnivory is distributed across the tree of life and has evolved at least six times independently, but sequenced and annotated nuclear genomes of carnivorous plants are currently lacking. We have sequenced and structurally annotated the nuclear genome of the carnivorous Roridula gorgonias and that of a non-carnivorous relative, Madeira’s lily-of-the-valley-tree, Clethra arborea, both within the Ericales. This data adds an important resource to study the evolutionary genetics of plant carnivory across angiosperm lineages and also for functional and systematic aspects of plants within the Ericales. Results Our assemblies have total lengths of 284 Mbp (R. gorgonias) and 511 Mbp (C. arborea) and show high BUSCO scores of 84.2% and 89.5%, respectively. We used their predicted genes together with publicly available data from other Ericales’ genomes and transcriptomes to assemble a phylogenomic data set for the inference of a species tree. However, groups of orthologs showed a marked absence of species represented by a transcriptome. We discuss possible reasons and caution against combining predicted genes from genome- and transriptome-based assemblies.

Download Full-text

Population Subdivision and Molecular Sequence Variation: Theory and Analysis of Drosophila ananassae Data

Genetics ◽

10.1093/genetics/165.3.1385 ◽

2003 ◽

Vol 165 (3) ◽

pp. 1385-1395

Author(s):

Claus Vogl ◽

Aparup Das ◽

Mark Beaumont ◽

Sujata Mohanty ◽

Wolfgang Stephan

Keyword(s):

Sequence Data ◽

Isolation By Distance ◽

Allele Frequencies ◽

Drosophila Ananassae ◽

Population Subdivision ◽

Variation Theory ◽

Peripheral Populations ◽

Molecular Variation ◽

Data Set ◽

Evolutionary Forces

Abstract Population subdivision complicates analysis of molecular variation. Even if neutrality is assumed, three evolutionary forces need to be considered: migration, mutation, and drift. Simplification can be achieved by assuming that the process of migration among and drift within subpopulations is occurring fast compared to mutation and drift in the entire population. This allows a two-step approach in the analysis: (i) analysis of population subdivision and (ii) analysis of molecular variation in the migrant pool. We model population subdivision using an infinite island model, where we allow the migration/drift parameter 0398; to vary among populations. Thus, central and peripheral populations can be differentiated. For inference of 0398;, we use a coalescence approach, implemented via a Markov chain Monte Carlo (MCMC) integration method that allows estimation of allele frequencies in the migrant pool. The second step of this approach (analysis of molecular variation in the migrant pool) uses the estimated allele frequencies in the migrant pool for the study of molecular variation. We apply this method to a Drosophila ananassae sequence data set. We find little indication of isolation by distance, but large differences in the migration parameter among populations. The population as a whole seems to be expanding. A population from Bogor (Java, Indonesia) shows the highest variation and seems closest to the species center.

Download Full-text

Molecular clocks and the early evolution of metazoan nervous systems

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2015.0046 ◽

2015 ◽

Vol 370 (1684) ◽

pp. 20150046 ◽

Cited By ~ 23

Author(s):

Gregory A. Wray

Keyword(s):

Sensory Processing ◽

Fossil Record ◽

Sequence Data ◽

Divergence Times ◽

Molecular Clocks ◽

Sense Organs ◽

Animal Evolution ◽

Nervous Systems ◽

Nervous System Evolution ◽

Fossil Records

The timing of early animal evolution remains poorly resolved, yet remains critical for understanding nervous system evolution. Methods for estimating divergence times from sequence data have improved considerably, providing a more refined understanding of key divergences. The best molecular estimates point to the origin of metazoans and bilaterians tens to hundreds of millions of years earlier than their first appearances in the fossil record. Both the molecular and fossil records are compatible, however, with the possibility of tiny, unskeletonized, low energy budget animals during the Proterozoic that had planktonic, benthic, or meiofaunal lifestyles. Such animals would likely have had relatively simple nervous systems equipped primarily to detect food, avoid inhospitable environments and locate mates. The appearance of the first macropredators during the Cambrian would have changed the selective landscape dramatically, likely driving the evolution of complex sense organs, sophisticated sensory processing systems, and diverse effector systems involved in capturing prey and avoiding predation.

Download Full-text

Frequent Closed Partial Orders Mining in Sequences

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.846-847.1304 ◽

2013 ◽

Vol 846-847 ◽

pp. 1304-1307

Author(s):

Ye Wang ◽

Yan Jia ◽

Lu Min Zhang

Keyword(s):

Sequence Data ◽

Real Data ◽

Partial Orders ◽

Hard Problem ◽

Important Data ◽

Data Set ◽

Pruning Algorithm ◽

Equal Chance ◽

Np Hard Problem ◽

General Sequences

Mining partial orders from sequence data is an important data mining task with broad applications. As partial orders mining is a NP-hard problem, many efficient pruning algorithm have been proposed. In this paper, we improve a classical algorithm of discovering frequent closed partial orders from string. For general sequences, we consider items appearing together having equal chance to calculate the detecting matrix used for pruning. Experimental evaluations from a real data set show that our algorithm can effectively mine FCPO from sequences.

Download Full-text

Human Disease Genes and Their Cloned Mouse Orthologs: Exploration of the FANTOM2 cDNA Sequence Data Set

Genome Research ◽

10.1101/gr.979503 ◽

2003 ◽

Vol 13 (6) ◽

pp. 1496-1500 ◽

Cited By ~ 3

Author(s):

L. M. Schriml

Keyword(s):

Human Disease ◽

Cdna Sequence ◽

Sequence Data ◽

Disease Genes ◽

Data Set ◽

Mouse Orthologs ◽

Human Disease Genes

Download Full-text

Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

Briefings in Bioinformatics ◽

10.1093/bib/bby017 ◽

2018 ◽

Vol 20 (4) ◽

pp. 1542-1559 ◽

Cited By ~ 44

Author(s):

Damla Senol Cali ◽

Jeremie S Kim ◽

Saugata Ghose ◽

Can Alkan ◽

Onur Mutlu

Keyword(s):

Sequence Analysis ◽

Genome Assembly ◽

Sequence Data ◽

Error Rates ◽

Nanopore Sequencing ◽

Memory Usage ◽

Sequencing Technology ◽

Assembly Pipeline ◽

And Performance ◽

Polishing Tool

Abstract Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.

Download Full-text

The correlations of the function and positional distribution of the cis-elements CArG around the TSS in the genes of Mus musculus

Genome ◽

10.1139/g08-117 ◽

2009 ◽

Vol 52 (3) ◽

pp. 217-221 ◽

Cited By ~ 4

Author(s):

Xia Shen ◽

Bruce Walsh ◽

Jing J. Li ◽

Hong X. Pang ◽

Wen J. Wang ◽

...

Keyword(s):

Binding Sites ◽

Copy Number ◽

Serum Response Factor ◽

Sequence Data ◽

Positional Distribution ◽

Cis Elements ◽

Transcription Start ◽

Data Set ◽

Control Set ◽

Serum Response

While many studies of cis-elements CArG bound by serum response factor (SRF) are in progress, little is known about the positional distribution of the functional CArG elements around the transcription start site (TSS) of genes that they influence. We use a validated CArG data set to calculate the distance distribution of functional CArG elements around the TSS. Distances between adjacent CArGs were also analyzed. We compare these distributions with those derived using a control set of randomly selected CArGs (that were not experimentally validated for function). Our results show that most functional CArG elements (108 of 152, 71%) exist upstream of the annotated TSS, with copy number increasing as one moves closer to the TSS. Moreover, the average number of the CArG elements in the CArG-containing genes is significantly more than that in the control genes. Our study extends earlier bioinformatic analyses of functional CArG elements and provides an application of comparative sequence data to the identification of transcription factor binding sites.

Download Full-text

Molecular Systematics of Tribe Physarieae (Brassicaceae) Based on Nuclear ITS, LUMINIDEPENDENS, and Chloroplast ndhF

Systematic Botany ◽

10.1600/036364421x16312067913318 ◽

2021 ◽

Author(s):

Sara Fuentes-Soriano ◽

Elizabeth A. Kellogg

Keyword(s):

Sequence Data ◽

Plastid Dna ◽

Sister Relationship ◽

Data Set ◽

Woody Perennial ◽

History Of ◽

Nuclear Its ◽

Phylogenetic Framework ◽

Phylogenetic Hypotheses ◽

Relationship Of

Physarieae is a small tribe of herbaceous annual and woody perennial mustards that are mostly endemic to North America, with its members including a large amount of variation in floral, fruit, and chromosomal variation. Building on a previous study of Physarieae based on morphology and ndhF plastid DNA, we reconstructed the evolutionary history of the tribe using new sequence data from two nuclear markers, and compared the new topologies against previously published cpDNA-based phylogenetic hypotheses. The novel analyses included ca. 420 new sequences of ITS and LUMINIDEPENDENS (LD) markers for 39 and 47 species, respectively, with sampling accounting for all seven genera of Physarieae, including nomenclatural type species, and 11 outgroup taxa. Maximum parsimony, maximum likelihood, and Bayesian analyses showed that these additional markers were largely consistent with the previous ndhF data that supported the monophyly of Physarieae and resolved two major clades within the tribe, i.e., DDNLS (Dithyrea, Dimorphocarpa, Nerisyrenia, Lyrocarpa, and Synthlipsis)and PP (Paysonia and Physaria). New analyses also increased internal resolution for some closely related species and lineages within both clades. The monophyly of Dithyrea and the sister relationship of Paysonia to Physaria was consistent in all trees, with the sister relationship of Nerisyrenia to Lyrocarpa supported by ndhF and ITS, and the positions of Dimorphocarpa and Synthlipsis shifted within the DDNLS Clade depending on the employed data set. Finally, using the strong, new phylogenetic framework of combined cpDNA + nDNA data, we discussed standing hypotheses of trichome evolution in the tribe suggested by ndhF.

Download Full-text