scholarly journals Do Alignment and Trimming Methods Matter for Phylogenomic (UCE) Analyses?

Author(s):  
Daniel M Portik ◽  
John J Wiens

Abstract Alignment is a crucial issue in molecular phylogenetics because different alignment methods can potentially yield very different topologies for individual genes. But it is unclear if the choice of alignment methods remains important in phylogenomic analyses, which incorporate data from hundreds or thousands of genes. For example, problematic biases in alignment might be multiplied across many loci, whereas alignment errors in individual genes might become irrelevant. The issue of alignment trimming (i.e., removing poorly aligned regions or missing data from individual genes) is also poorly explored. Here, we test the impact of 12 different combinations of alignment and trimming methods on phylogenomic analyses. We compare these methods using published phylogenomic data from ultraconserved elements (UCEs) from squamate reptiles (lizards and snakes), birds, and tetrapods. We compare the properties of alignments generated by different alignment and trimming methods (e.g., length, informative sites, missing data). We also test whether these data sets can recover well-established clades when analyzed with concatenated (RAxML) and species-tree methods (ASTRAL-III), using the full data ($\sim $5000 loci) and subsampled data sets (10% and 1% of loci). We show that different alignment and trimming methods can significantly impact various aspects of phylogenomic data sets (e.g., length, informative sites). However, these different methods generally had little impact on the recovery and support values for well-established clades, even across very different numbers of loci. Nevertheless, our results suggest several “best practices” for alignment and trimming. Intriguingly, the choice of phylogenetic methods impacted the phylogenetic results most strongly, with concatenated analyses recovering significantly more well-established clades (with stronger support) than the species-tree analyses. [Alignment; concatenated analysis; phylogenomics; sequence length heterogeneity; species-tree analysis; trimming]

2022 ◽  
Author(s):  
XiaoXu Pang ◽  
Da-Yong Zhang

The species studied in any evolutionary investigation generally constitute a very small proportion of all the species currently existing or that have gone extinct. It is therefore likely that introgression, which is widespread across the tree of life, involves "ghosts," i.e., unsampled, unknown, or extinct lineages. However, the impact of ghost introgression on estimations of species trees has been rarely studied and is thus poorly understood. In this study, we use mathematical analysis and simulations to examine the robustness of species tree methods based on a multispecies coalescent model under gene flow sourcing from an extant or ghost lineage. We found that very low levels of extant or ghost introgression can result in anomalous gene trees (AGTs) on three-taxon rooted trees if accompanied by strong incomplete lineage sorting (ILS). In contrast, even massive introgression, with more than half of the recipient genome descending from the donor lineage, may not necessarily lead to AGTs. In cases involving an ingroup lineage (defined as one that diverged no earlier than the most basal species under investigation) acting as the donor of introgression, the time of root divergence among the investigated species was either underestimated or remained unaffected, but for the cases of outgroup ghost lineages acting as donors, the divergence time was generally overestimated. Under many conditions of ingroup introgression, the stronger the ILS was, the higher was the accuracy of estimating the time of root divergence, although the topology of the species tree is more prone to be biased by the effect of introgression.


Author(s):  
Fereshteh Shahoveisi ◽  
Atena Oladzad ◽  
Luis E. del Rio Mendoza ◽  
Seyedali Hosseinirad ◽  
Susan Ruud ◽  
...  

The polyploid nature of canola (Brassica napus) represents a challenge for the accurate identification of single nucleotide polymorphisms (SNPs) and the detection of quantitative trait loci (QTL). In this study, combinations of eight phenotyping scoring systems and six SNP calling and filtering parameters were evaluated for their efficiency in detection of QTL associated with response to Sclerotinia stem rot, caused by Sclerotinia sclerotiorum, in two doubled haploid (DH) canola mapping populations. Most QTL were detected in lesion length, relative areas under the disease progress curve (rAUDPC) for lesion length, and binomial-plant mortality data sets. Binomial data derived from lesion size were less efficient in QTL detection. Inclusion of additional phenotypic sets to the analysis increased the numbers of significant QTL by 2.3-fold; however, the continuous data sets were more efficient. Between two filtering parameters used to analyze genotyping by sequencing (GBS) data, imputation of missing data increased QTL detection in one population with a high level of missing data but not in the other. Inclusion of segregation-distorted SNPs increased QTL detection but did not impact their R2 values significantly. Twelve of the 16 detected QTL were on chromosomes A02 and C01, and the rest were on A07, A09, and C03. Marker A02-7594120, associated with a QTL on chromosome A02 was detected in both populations. Results of this study suggest the impact of genotypic variant calling and filtering parameters may be population dependent while deriving additional phenotyping scoring systems such as rAUDPC datasets and mortality binary may improve QTL detection efficiency.


2020 ◽  
Author(s):  
Dustin J. Wcisel ◽  
J. Thomas Howard ◽  
Jeffrey A. Yoder ◽  
Alex Dornburg

Abstract Background Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource. Results We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question. Conclusions TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.


2019 ◽  
Author(s):  
Alex Dornburg ◽  
Dustin J. Wcisel ◽  
J. Thomas Howard ◽  
Jeffrey A. Yoder

Abstract Background Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource.Results We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question.Conclusions TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.


Author(s):  
D. O. Nyaboga ◽  
A. Mwangi ◽  
D. Lusweti

Missing data is a common problem in real word studies especially clinical studies. However, most people working with such data, often drop missing cases from individuals with incomplete observations that occur when patients do not complete the treatment or miss their scheduled visits. This may lead to misleading results and ultimately affect the decision of whether an intervention is good or bad for the patients under treatment. The comparison of Complete Case (CC) and Inverse Probability Weights (IPW) techniques of handling missing data in various models has been addressed, however little has been done to compare these methods when applied to joint models of longitudinal and time to event data. Therefore, this paper seeks to investigate the impact of assuming CC analysis on clinical data with missing cases, comparing it with IPW method when fitting joint models of longitudinal and survival data setting full data model as the baseline model. This paper made use of randomized aids clinical trial data. The model with Deviance Information Criteria (DIC) close to that of full data joint model is considered the best. From the results, joint models from full data, CC and IPW had DIC of 10603.94, 8410.33 and 10600.95 respectively. The joint model obtained from IPW data had a DIC too close to that of full data joint model as compared to model from CC data.


2021 ◽  
Author(s):  
Trenton J. Davis ◽  
Tarek R. Firzli ◽  
Emily A. Higgins Keppler ◽  
Matt Richardson ◽  
Heather D. Bean

Missing data is a significant issue in metabolomics that is often neglected when conducting data pre-processing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metab-olomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatog-raphy (GC×GC) data sets. We also present these goals in the context of experimental replication whereby imputation is con-ducted in a within-replicate-based fashion—the first description and evaluation of this strategy—and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approach-es (Bayesian principal components analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially im-portant features in downstream analyses for biomarker discovery.


2017 ◽  
Author(s):  
Erin K. Molloy ◽  
Tandy Warnow

AbstractSpecies tree estimation from loci sampled from multiple genomes is now common, but is challenged by the heterogeneity across the genome due to multiple processes, such as gene duplication and loss, horizontal gene transfer, and incomplete lineage sorting. Although methods for estimating species trees have been developed that address gene tree heterogeneity due to incomplete lineage sorting, many of these methods operate by combining estimated gene trees and are hence vulnerable to gene tree quality. There is also the added concern that missing data, which is frequently encountered in genome-scale datasets, will impact species tree estimation.Our study addresses the impact of gene filtering on species trees inferred from multi-gene datasets. We address these questions using a large and heterogeneous collection of simulated datasets both with and without missing data. We compare several established coalescent-based methods (ASTRAL, ASTRID, MP-EST, and SVDquartets within PAUP*) as well as unpartitioned concatenation using maximum likelihood (RAxML).Our study shows that gene tree error and missing data impact all methods (and some methods degrade more than others), but the degree of incomplete lineage sorting and gene tree estimation error impacts the absolute and relative performance of methods as well as their response to gene filtering strategies. We find that filtering genes based on the degree of missing data is either neutral or else reduces the accuracy of all five methods examined, and so is not recommended. Filtering genes based on gene tree estimation error shows somewhat different trends. Under low levels of incomplete lineage sorting, removing genes with high gene tree estimation error can improve the accuracy of summary methods, but only if not too many genes are removed. Otherwise, filtering genes tends to increase error, especially under high levels of incomplete lineage sorting. Hence, while filtering genes based on missing data is not recommended, there are conditions under which removing high error gene trees can improve species tree estimation. This study provides insights into prior studies and suggests approaches for analyzing phylogenomic datasets.


2017 ◽  
Vol 13 (9) ◽  
pp. 20170393 ◽  
Author(s):  
Jeffrey W. Streicher ◽  
John J. Wiens

Squamate reptiles (lizards and snakes) are the most diverse group of terrestrial vertebrates, with more than 10 000 species. Despite considerable effort to resolve relationships among major squamates clades, some branches have remained difficult. Among the most vexing has been the placement of snakes among lizard families, with most studies yielding only weak support for the position of snakes. Furthermore, the placement of iguanian lizards has remained controversial. Here we used targeted sequence capture to obtain data from 4178 nuclear loci from ultraconserved elements from 32 squamate taxa (and five outgroups) including representatives of all major squamate groups. Using both concatenated and species-tree methods, we recover strong support for a sister relationship between iguanian and anguimorph lizards, with snakes strongly supported as the sister group of these two clades. These analyses strongly resolve the difficult placement of snakes within squamates and show overwhelming support for the contentious position of iguanians. More generally, we provide a strongly supported hypothesis of higher-level relationships in the most species-rich tetrapod clade using coalescent-based species-tree methods and approximately 100 times more loci than previous estimates.


2015 ◽  
Vol 33 (3) ◽  
pp. 838-860 ◽  
Author(s):  
Zhenxiang Xi ◽  
Liang Liu ◽  
Charles C. Davis

Sign in / Sign up

Export Citation Format

Share Document