scholarly journals Exploring the phylogeny of rosids with a five-locus supermatrix from GenBank

2019 ◽  
Author(s):  
Miao Sun ◽  
Ryan A. Folk ◽  
Matthew A. Gitzendanner ◽  
Stephen A. Smith ◽  
Charlotte Germain-Aubrey ◽  
...  

AbstractCurrent advances in sequencing technology have greatly increased the availability of sequence data from public genetic databases. With data from GenBank, we assemble and phylogenetically investigate a 19,740-taxon, five-locus supermatrix (i.e., atpB, rbcL, matK, matR, and ITS) for rosids, a large clade containing over 90,000 species, or approximately a quarter of all angiosperms (assuming an estimate of 400,000 angiosperm species). The topology and divergence times of the five-locus tree generally agree with previous estimates of rosid phylogeny, and we recover greater resolution and support in several areas along the rosid backbone, but with a few significant differences (e.g., the placement of the COM clade, as well as Myrtales, Vitales, and Zygophyllales). Our five-locus phylogeny is the most comprehensive DNA data set yet compiled for the rosid clade. Yet, even with 19,740 species, current sampling represents only 16-22% of all rosids, and we also find evidence of strong phylogenetic bias in the accumulation of GenBank data, highlighting continued challenges for species coverage. These limitations also exist in other major angiosperm clades (e.g., asterids, monocots) as well as other large, understudied branches of the Tree of Life, highlighting the need for broader molecular sampling. Nevertheless, the phylogeny presented here improves upon sampling by more than two-fold and will be an important resource for macroevolutionary studies of this pivotal clade.

mSystems ◽  
2018 ◽  
Vol 3 (3) ◽  
Author(s):  
Gabriel A. Al-Ghalith ◽  
Benjamin Hillmann ◽  
Kaiwei Ang ◽  
Robin Shields-Cutler ◽  
Dan Knights

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.


2019 ◽  
Vol 19 (1) ◽  
Author(s):  
Yan Du ◽  
Shaoyuan Wu ◽  
Scott V. Edwards ◽  
Liang Liu

Abstract Background The flood of genomic data to help build and date the tree of life requires automation at several critical junctures, most importantly during sequence assembly and alignment. It is widely appreciated that automated alignment protocols can yield inaccuracies, but the relative impact of various sources error on phylogenomic analysis is not yet known. This study employs an updated mammal data set of 5162 coding loci sampled from 90 species to evaluate the effects of alignment uncertainty, substitution models, and fossil priors on gene tree, species tree, and divergence time estimation. Additionally, a novel coalescent likelihood ratio test is introduced for comparing competing species trees against a given set of gene trees. Results The aligned DNA sequences of 5162 loci from 90 species were trimmed and filtered using trimAL and two filtering protocols. The final dataset contains 4 sets of alignments - before trimming, after trimming, filtered by a recently proposed pipeline, and further filtered by comparing ML gene trees for each locus with the concatenation tree. Our analyses suggest that the average discordance among the coalescent trees is significantly smaller than that among the concatenation trees estimated from the 4 sets of alignments or with different substitution models. There is no significant difference among the divergence times estimated with different substitution models. However, the divergence dates estimated from the alignments after trimming are more recent than those estimated from the alignments before trimming. Conclusions Our results highlight that alignment uncertainty of the updated mammal data set and the choice of substitution models have little impact on tree topologies yielded by coalescent methods for species tree estimation, whereas they are more influential on the trees made by concatenation. Given the choice of calibration scheme and clock models, divergence time estimates are robust to the choice of substitution models, but removing alignments deemed problematic by trimming algorithms can lead to more recent dates. Although the fossil prior is important in divergence time estimation, Bayesian estimates of divergence times in this data set are driven primarily by the sequence data.


2020 ◽  
Vol 13 (1) ◽  
Author(s):  
Stefanie Hartmann ◽  
Michaela Preick ◽  
Silke Abelt ◽  
André Scheffel ◽  
Michael Hofreiter

Abstract Objective Plant carnivory is distributed across the tree of life and has evolved at least six times independently, but sequenced and annotated nuclear genomes of carnivorous plants are currently lacking. We have sequenced and structurally annotated the nuclear genome of the carnivorous Roridula gorgonias and that of a non-carnivorous relative, Madeira’s lily-of-the-valley-tree, Clethra arborea, both within the Ericales. This data adds an important resource to study the evolutionary genetics of plant carnivory across angiosperm lineages and also for functional and systematic aspects of plants within the Ericales. Results Our assemblies have total lengths of 284 Mbp (R. gorgonias) and 511 Mbp (C. arborea) and show high BUSCO scores of 84.2% and 89.5%, respectively. We used their predicted genes together with publicly available data from other Ericales’ genomes and transcriptomes to assemble a phylogenomic data set for the inference of a species tree. However, groups of orthologs showed a marked absence of species represented by a transcriptome. We discuss possible reasons and caution against combining predicted genes from genome- and transriptome-based assemblies.


Genetics ◽  
2003 ◽  
Vol 165 (3) ◽  
pp. 1385-1395
Author(s):  
Claus Vogl ◽  
Aparup Das ◽  
Mark Beaumont ◽  
Sujata Mohanty ◽  
Wolfgang Stephan

Abstract Population subdivision complicates analysis of molecular variation. Even if neutrality is assumed, three evolutionary forces need to be considered: migration, mutation, and drift. Simplification can be achieved by assuming that the process of migration among and drift within subpopulations is occurring fast compared to mutation and drift in the entire population. This allows a two-step approach in the analysis: (i) analysis of population subdivision and (ii) analysis of molecular variation in the migrant pool. We model population subdivision using an infinite island model, where we allow the migration/drift parameter 0398; to vary among populations. Thus, central and peripheral populations can be differentiated. For inference of 0398;, we use a coalescence approach, implemented via a Markov chain Monte Carlo (MCMC) integration method that allows estimation of allele frequencies in the migrant pool. The second step of this approach (analysis of molecular variation in the migrant pool) uses the estimated allele frequencies in the migrant pool for the study of molecular variation. We apply this method to a Drosophila ananassae sequence data set. We find little indication of isolation by distance, but large differences in the migration parameter among populations. The population as a whole seems to be expanding. A population from Bogor (Java, Indonesia) shows the highest variation and seems closest to the species center.


2015 ◽  
Vol 370 (1684) ◽  
pp. 20150046 ◽  
Author(s):  
Gregory A. Wray

The timing of early animal evolution remains poorly resolved, yet remains critical for understanding nervous system evolution. Methods for estimating divergence times from sequence data have improved considerably, providing a more refined understanding of key divergences. The best molecular estimates point to the origin of metazoans and bilaterians tens to hundreds of millions of years earlier than their first appearances in the fossil record. Both the molecular and fossil records are compatible, however, with the possibility of tiny, unskeletonized, low energy budget animals during the Proterozoic that had planktonic, benthic, or meiofaunal lifestyles. Such animals would likely have had relatively simple nervous systems equipped primarily to detect food, avoid inhospitable environments and locate mates. The appearance of the first macropredators during the Cambrian would have changed the selective landscape dramatically, likely driving the evolution of complex sense organs, sophisticated sensory processing systems, and diverse effector systems involved in capturing prey and avoiding predation.


2013 ◽  
Vol 846-847 ◽  
pp. 1304-1307
Author(s):  
Ye Wang ◽  
Yan Jia ◽  
Lu Min Zhang

Mining partial orders from sequence data is an important data mining task with broad applications. As partial orders mining is a NP-hard problem, many efficient pruning algorithm have been proposed. In this paper, we improve a classical algorithm of discovering frequent closed partial orders from string. For general sequences, we consider items appearing together having equal chance to calculate the detecting matrix used for pruning. Experimental evaluations from a real data set show that our algorithm can effectively mine FCPO from sequences.


2018 ◽  
Vol 20 (4) ◽  
pp. 1542-1559 ◽  
Author(s):  
Damla Senol Cali ◽  
Jeremie S Kim ◽  
Saugata Ghose ◽  
Can Alkan ◽  
Onur Mutlu

Abstract Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.


Genome ◽  
2009 ◽  
Vol 52 (3) ◽  
pp. 217-221 ◽  
Author(s):  
Xia Shen ◽  
Bruce Walsh ◽  
Jing J. Li ◽  
Hong X. Pang ◽  
Wen J. Wang ◽  
...  

While many studies of cis-elements CArG bound by serum response factor (SRF) are in progress, little is known about the positional distribution of the functional CArG elements around the transcription start site (TSS) of genes that they influence. We use a validated CArG data set to calculate the distance distribution of functional CArG elements around the TSS. Distances between adjacent CArGs were also analyzed. We compare these distributions with those derived using a control set of randomly selected CArGs (that were not experimentally validated for function). Our results show that most functional CArG elements (108 of 152, 71%) exist upstream of the annotated TSS, with copy number increasing as one moves closer to the TSS. Moreover, the average number of the CArG elements in the CArG-containing genes is significantly more than that in the control genes. Our study extends earlier bioinformatic analyses of functional CArG elements and provides an application of comparative sequence data to the identification of transcription factor binding sites.


Author(s):  
Sara Fuentes-Soriano ◽  
Elizabeth A. Kellogg

Physarieae is a small tribe of herbaceous annual and woody perennial mustards that are mostly endemic to North America, with its members including a large amount of variation in floral, fruit, and chromosomal variation. Building on a previous study of Physarieae based on morphology and ndhF plastid DNA, we reconstructed the evolutionary history of the tribe using new sequence data from two nuclear markers, and compared the new topologies against previously published cpDNA-based phylogenetic hypotheses. The novel analyses included ca. 420 new sequences of ITS and LUMINIDEPENDENS (LD) markers for 39 and 47 species, respectively, with sampling accounting for all seven genera of Physarieae, including nomenclatural type species, and 11 outgroup taxa. Maximum parsimony, maximum likelihood, and Bayesian analyses showed that these additional markers were largely consistent with the previous ndhF data that supported the monophyly of Physarieae and resolved two major clades within the tribe, i.e., DDNLS (Dithyrea, Dimorphocarpa, Nerisyrenia, Lyrocarpa, and Synthlipsis)and PP (Paysonia and Physaria). New analyses also increased internal resolution for some closely related species and lineages within both clades. The monophyly of Dithyrea and the sister relationship of Paysonia to Physaria was consistent in all trees, with the sister relationship of Nerisyrenia to Lyrocarpa supported by ndhF and ITS, and the positions of Dimorphocarpa and Synthlipsis shifted within the DDNLS Clade depending on the employed data set. Finally, using the strong, new phylogenetic framework of combined cpDNA + nDNA data, we discussed standing hypotheses of trichome evolution in the tribe suggested by ndhF.


Sign in / Sign up

Export Citation Format

Share Document