scholarly journals The Interaction between Base Compositional Heterogeneity and Among-Site Rate Variation in Models of Molecular Evolution

2013 ◽  
Vol 2013 ◽  
pp. 1-8 ◽  
Author(s):  
Nathan C. Sheffield

Many commonly used models of molecular evolution assume homogeneous nucleotide frequencies. A deviation from this assumption has been shown to cause problems for phylogenetic inference. However, some claim that only extreme heterogeneity affects phylogenetic accuracy and suggest that violations of other model assumptions, such as variable rates among sites, are more problematic. In order to explore the interaction between compositional heterogeneity and variable rates among sites, I reanalyzed 3 real heterogeneous datasets using several models. My Bayesian inference recovers accurate topologies under variable rates-among-sites models, but fails under some models that account for compositional heterogeneity. I also ran simulations and found that accounting for rates among sites improves topology accuracy in compositionally heterogeneous data. This indicates that in some cases, models accounting for among-site rate variation can improve outcomes for data that violates the assumption of compositional homogeneity.

2010 ◽  
Vol 35 (3) ◽  
pp. 429-448 ◽  
Author(s):  
HOJUN SONG ◽  
NATHAN C. SHEFFIELD ◽  
STEPHEN L. CAMERON ◽  
KELLY B. MILLER ◽  
MICHAEL F. WHITING

1991 ◽  
Vol 66 (4) ◽  
pp. 411-453 ◽  
Author(s):  
David M. Hillis ◽  
Michael T. Dixon

1992 ◽  
Vol 6 ◽  
pp. 68-68
Author(s):  
Timothy Collins

The marine vicariant event resulting from the Pliocene emergence of the Central American Isthmus presents a unique opportunity for calibrating rates of molecular evolution. The synchronous fragmentation of the ranges of previously widespread taxa into Western Atlantic and Eastern Pacific components (geminates) enables one to make comparisons of rates among higher taxa on the same time scale and to evaluate the regularity of rates of molecular evolution among all species sampled. Other advantages of this approach are that the time scale (approximately 3 Ma) is one of particular interest for evolutionary biologists concerned with speciation and one that minimizes the ambiguities associated with augmentation of divergence values to account for multiple hits at a site. The divergence values derived for geminate pairs are independent, allowing statistical evaluation of variance in rates.The current popularity of the relative rates test as the final arbiter of questions regarding rates and rate variation is primarily a matter of convenience and not a reflection of methodological superiority. A review of the commonly used techniques for calibrating rates of molecular evolution shows that each approach has limitations. Temporally based calibrations of rates are necessary complements to time-independent comparisons.Interpretation of transisthmian molecular comparisons in the literature have in many cases been unduly influenced and confused by molecular clock assumptions and the restriction of studies to single higher-level taxa. Analysis of the apparently contradictory published data as well as new results from sequence comparisons of fishes, urchins and snails suggests a synthesis: taxon specific rates of molecular evolution, with reduced variance within taxonomic groups and great variance among all groups sampled.


Biostatistics ◽  
2010 ◽  
Vol 11 (2) ◽  
pp. 317-336 ◽  
Author(s):  
Sylvia Frühwirth-Schnatter ◽  
Saumyadipta Pyne

Abstract Skew-normal and skew-t distributions have proved to be useful for capturing skewness and kurtosis in data directly without transformation. Recently, finite mixtures of such distributions have been considered as a more general tool for handling heterogeneous data involving asymmetric behaviors across subpopulations. We consider such mixture models for both univariate as well as multivariate data. This allows robust modeling of high-dimensional multimodal and asymmetric data generated by popular biotechnological platforms such as flow cytometry. We develop Bayesian inference based on data augmentation and Markov chain Monte Carlo (MCMC) sampling. In addition to the latent allocations, data augmentation is based on a stochastic representation of the skew-normal distribution in terms of a random-effects model with truncated normal random effects. For finite mixtures of skew normals, this leads to a Gibbs sampling scheme that draws from standard densities only. This MCMC scheme is extended to mixtures of skew-t distributions based on representing the skew-t distribution as a scale mixture of skew normals. As an important application of our new method, we demonstrate how it provides a new computational framework for automated analysis of high-dimensional flow cytometric data. Using multivariate skew-normal and skew-t mixture models, we could model non-Gaussian cell populations rigorously and directly without transformation or projection to lower dimensions.


2011 ◽  
Vol 366 (1577) ◽  
pp. 2503-2513 ◽  
Author(s):  
Lindell Bromham

DNA sequences evolve at different rates in different species. This rate variation has been most closely examined in mammals, revealing a large number of characteristics that can shape the rate of molecular evolution. Many of these traits are part of the mammalian life-history continuum: species with small body size, rapid generation turnover, high fecundity and short lifespans tend to have faster rates of molecular evolution. In addition, rate of molecular evolution in mammals might be influenced by behaviour (such as mating system), ecological factors (such as range restriction) and evolutionary history (such as diversification rate). I discuss the evidence for these patterns of rate variation, and the possible explanations of these correlations. I also consider the impact of these systematic patterns of rate variation on the reliability of the molecular date estimates that have been used to suggest a Cretaceous radiation of modern mammals, before the final extinction of the dinosaurs.


2021 ◽  
Vol 17 (8) ◽  
pp. e1009283
Author(s):  
Tomasz Konopka ◽  
Sandra Ng ◽  
Damian Smedley

Integrating reference datasets (e.g. from high-throughput experiments) with unstructured and manually-assembled information (e.g. notes or comments from individual researchers) has the potential to tailor bioinformatic analyses to specific needs and to lead to new insights. However, developing bespoke analysis pipelines from scratch is time-consuming, and general tools for exploring such heterogeneous data are not available. We argue that by treating all data as text, a knowledge-base can accommodate a range of bioinformatic data types and applications. We show that a database coupled to nearest-neighbor algorithms can address common tasks such as gene-set analysis as well as specific tasks such as ontology translation. We further show that a mathematical transformation motivated by diffusion can be effective for exploration across heterogeneous datasets. Diffusion enables the knowledge-base to begin with a sparse query, impute more features, and find matches that would otherwise remain hidden. This can be used, for example, to map multi-modal queries consisting of gene symbols and phenotypes to descriptions of diseases. Diffusion also enables user-driven learning: when the knowledge-base cannot provide satisfactory search results in the first instance, users can improve the results in real-time by adding domain-specific knowledge. User-driven learning has implications for data management, integration, and curation.


2020 ◽  
Vol 59 (1) ◽  
pp. 5-30
Author(s):  
Silvia Adrián‐Serrano ◽  
Jesus Lozano‐Fernandez ◽  
Joan Pons ◽  
Julio Rozas ◽  
Miquel A. Arnedo

2019 ◽  
Vol 1 (12) ◽  
Author(s):  
Najat Ali ◽  
Daniel Neagu ◽  
Paul Trundle

Abstract Distance-based algorithms are widely used for data classification problems. The k-nearest neighbour classification (k-NN) is one of the most popular distance-based algorithms. This classification is based on measuring the distances between the test sample and the training samples to determine the final classification output. The traditional k-NN classifier works naturally with numerical data. The main objective of this paper is to investigate the performance of k-NN on heterogeneous datasets, where data can be described as a mixture of numerical and categorical features. For the sake of simplicity, this work considers only one type of categorical data, which is binary data. In this paper, several similarity measures have been defined based on a combination between well-known distances for both numerical and binary data, and to investigate k-NN performances for classifying such heterogeneous data sets. The experiments used six heterogeneous datasets from different domains and two categories of measures. Experimental results showed that the proposed measures performed better for heterogeneous data than Euclidean distance, and that the challenges raised by the nature of heterogeneous data need personalised similarity measures adapted to the data characteristics.


Sign in / Sign up

Export Citation Format

Share Document