Measuring Phylogenetic Information of Incomplete Sequence Data

Abstract Widely used approaches for extracting phylogenetic information from aligned sets of molecular sequences rely upon probabilistic models of nucleotide substitution or amino-acid replacement. The phylogenetic information that can be extracted depends on the number of columns in the sequence alignment and will be decreased when the alignment contains gaps due to insertion or deletion events. Motivated by the measurement of information loss, we suggest assessment of the Effective Sequence Length (ESL) of an aligned data set. The ESL can differ from the actual number of columns in a sequence alignment because of the presence of alignment gaps. Furthermore, the estimation of phylogenetic information is affected by model misspecification. Inevitably, the actual process of molecular evolution differs from the probabilistic models employed to describe this process. This disparity means the amount of phylogenetic information in an actual sequence alignment will differ from the amount in a simulated data set of equal size, which motivated us to develop a new test for model adequacy. Via theory and empirical data analysis, we show how to disentangle the effects of gaps and model misspecification. By comparing the Fisher information of actual and simulated sequences, we identify which alignment sites and tree branches are most affected by gaps and model misspecification.

Download Full-text

Nonlinear Multiresponse Parameter Estimation Using Simplex Optimization Method

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.1105.299 ◽

2015 ◽

Vol 1105 ◽

pp. 299-304

Author(s):

A. Al Saleh Mohammad ◽

A. Yussuf Abdirahman

Keyword(s):

Probabilistic Models ◽

Length Distribution ◽

Simulated Data ◽

Optimization Method ◽

Sequence Length ◽

Distribution Data ◽

Multiple Site ◽

First Order ◽

Site Type ◽

Market Requirements

Polyolefin molecular architectures are designed according to customer needs and demands. Hence, it is essential to determine the catalytic behavior that gives the polymer the characteristics it needs to meet the market requirements. Today most of the industrial polyolefin production depends on multiple-site-type catalysts such as Ziegler-Natta catalysts. In this work a methodology to estimate parameters for polyolefin multiple-site-type catalysts was presented. The sequence length distribution data were simulated using Zeroth-order and First-order Markovian models. These simulated data were used to test the robustness of the optimization method. The optimization method used was able to retrieve and comprehend the proper probabilistic models and provide acceptable polymerization parameters estimates.

Download Full-text

Strategies for Partitioning Clock Models in Phylogenomic Dating: Application to the Angiosperm Evolutionary Timescale

10.1101/144287 ◽

2017 ◽

Author(s):

Charles S. P. Foster ◽

Simon Y. W. Ho

Keyword(s):

Sequence Data ◽

Divergence Time ◽

Molecular Dating ◽

Sequence Length ◽

Rate Heterogeneity ◽

Data Set ◽

Molecular Sequence Data ◽

Molecular Sequence ◽

Divergence Time Estimates ◽

Time Estimates

AbstractEvolutionary timescales can be inferred from molecular sequence data using a Bayesian phylogenetic approach. In these methods, the molecular clock is often calibrated using fossil data. The uncertainty in these fossil calibrations is important because it determines the limiting posterior distribution for divergence-time estimates as the sequence length tends to infinity. Here we investigate how the accuracy and precision of Bayesian divergence-time estimates improve with the increased clock-partitioning of genome-scale data into clock-subsets. We focus on a data set comprising plastome-scale sequences of 52 angiosperm taxa. There was little difference among the Bayesian date estimates whether we chose clock-subsets based on patterns of among-lineage rate heterogeneity or relative rates across genes, or by random assignment. Increasing the degree of clock-partitioning usually led to an improvement in the precision of divergence-time estimates, but this increase was asymptotic to a limit presumably imposed by fossil calibrations. Our clock-partitioning approaches yielded highly precise age estimates for several key nodes in the angiosperm phylogeny. For example, when partitioning the data into 20 clock-subsets based on patterns of among-lineage rate heterogeneity, we inferred crown angiosperms to have arisen 198–178 Ma. This demonstrates that judicious clock-partitioning can improve the precision of molecular dating based on phylogenomic data, but the meaning of this increased precision should be considered critically.

Download Full-text

Automation and Evaluation of the SOWH Test with SOWHAT

10.1101/005264 ◽

2014 ◽

Cited By ~ 3

Author(s):

Samuel H. Church ◽

Joseph F. Ryan ◽

Casey W. Dunn

Keyword(s):

Sequence Data ◽

Alternative Hypothesis ◽

Model Misspecification ◽

Null Distribution ◽

Simulated Data ◽

Bootstrap Sample ◽

Bootstrap Sampling ◽

Data Simulation ◽

Phylogenetic Method ◽

Topology Testing

The Swofford-Olsen-Waddell-Hillis (SOWH) test evaluates statistical support for incongruent phylogenetic topologies. It is commonly applied to determine if the maximum likelihood tree in a phylogenetic analysis is significantly different than an alternative hypothesis. The SOWH test compares the observed difference in likelihood between two topologies to a null distribution of differences in likelihood generated by parametric resampling. The test is a well-established phylogenetic method for topology testing, but is is sensitive to model misspecification, it is computationally burdensome to perform, and its implementation requires the investigator to make multiple decisions that each have the potential to affect the outcome of the test. We analyzed the effects of multiple factors using seven datasets to which the SOWH test was previously applied. These factors include bootstrap sample size, likelihood software, the introduction of gaps to simulated data, the use of distinct models of evolution for data simulation and likelihood inference, and a suggested test correction wherein an unresolved "zero-constrained" tree is used to simulate sequence data. In order to facilitate these analyses and future applications of the SOWH test, we wrote SOWHAT, a program that automates the SOWH test. We find that inadequate bootstrap sampling can change the outcome of the SOWH test. The results also show that using a zero-constrained tree for data simulation can result in a wider null distribution and higher p-values, but does not change the outcome of the SOWH test for most datasets. These results will help others implement and evaluate the SOWH test and allow us to provide recommendation for future applications of the SOWH test. SOWHAT is available for download from https://github.com/josephryan/SOWHAT.

Download Full-text

Large Sample Approximations of Probabilities of Correct Evolutionary Tree Estimation and Biases of Maximum Likelihood Estimation

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1626 ◽

2011 ◽

Vol 10 (1) ◽

Cited By ~ 5

Author(s):

Edward Susko

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimation ◽

Sequence Data ◽

Model Misspecification ◽

Computational Cost ◽

Likelihood Estimation ◽

Sequence Length ◽

Star Tree ◽

Tree Estimation ◽

Computationally Intensive

Simulation studies have been the main way in which properties of maximum likelihood estimation of evolutionary trees from aligned sequence data have been studied. Because trees are unusual parameters and because fitting is computationally intensive, such studies have a heavy computational cost. We develop an asymptotic framework that can be used to obtain probabilities of correct topological reconstruction and study other properties of likelihood methods when a single split is poorly resolved. Simulations suggest that while approximations to log likelihood differences are better for less well-resolved topologies, approximations to probabilities of correct reconstruction are generally good. We used the approximations to investigate biases in estimation and found that maximum likelihood estimation has a long-branch-repels bias. This differs from the long-branch-attracts bias often reported in the literature because it is a different form of bias. For maximum likelihood estimation, usually long-branch-attracts bias results arise in the presence of model misspecification and are a form of statistical inconsistency where the estimated tree converges upon an incorrect tree with long edges together. Here, by bias we mean a tendency to favour a particular topology when data are generated from a four-taxon star tree. While we find a tendency to favour the tree with long branches apart, with more extreme long edges, a strong small sequence-length long-branch-attracts bias overwhelms the long-branch-repels bias. The long-branch-repels bias generalizes to five and six taxa in the sense that subtrees containing taxa that are all distant from the poorly resolved split repel each other.

Download Full-text

Effects of management decisions on genetic evaluation of simulated calving records using random regression

Translational Animal Science ◽

10.1093/tas/txab078 ◽

2021 ◽

Author(s):

M D MacNeil ◽

J W Buchanan ◽

M L Spangler ◽

E Hay

Keyword(s):

Reproductive Success ◽

Simulated Data ◽

Genetic Evaluation ◽

Random Regression ◽

Management Decisions ◽

Third Order ◽

Data Set ◽

Binary Phenotype ◽

Random Regression Model ◽

Missing Observation

Abstract The objective of this study was to evaluate the effects of various data structures on the genetic evaluation for the binary phenotype of reproductive success. The data were simulated based on an existing pedigree and an underlying fertility phenotype with a heritability of 0.10. A data set of complete observations was generated for all cows. This data set was then modified mimicking the culling of cows when they first failed to reproduce, cows having a missing observation at either their second or fifth opportunity to reproduce as if they had been selected as donors for embryo transfer, and censoring records following the sixth opportunity to reproduce as in a cull-for-age strategy. The data were analyzed using a third order polynomial random regression model. The EBV of interest for each animal was the sum of the age-specific EBV over the first 10 observations (reproductive success at ages 2-11). Thus, the EBV might be interpreted as the genetic expectation of number of calves produced when a female is given ten opportunities to calve. Culling open cows resulted in the EBV for 3 year-old cows being reduced from 8.27 ± 0.03 when open cows were retained to 7.60 ± 0.02 when they were culled. The magnitude of this effect decreased as cows grew older when they first failed to reproduce and were subsequently culled. Cows that did not fail over the 11 years of simulated data had an EBV of 9.43 ± 0.01 and 9.35 ± 0.01 based on analyses of the complete data and the data in which cows that failed to reproduce were culled, respectively. Cows that had a missing observation for their second record had a significantly reduced EBV, but the corresponding effect at the fifth record was negligible. The current study illustrates that culling and management decisions, and particularly those that impact the beginning of the trajectory of sustained reproductive success, can influence both the magnitude and accuracy of resulting EBV.

Download Full-text

A Traveler’s Guide to the Multiverse: Promises, Pitfalls, and a Framework for the Evaluation of Analytic Decisions

Advances in Methods and Practices in Psychological Science ◽

10.1177/2515245920954925 ◽

2021 ◽

Vol 4 (1) ◽

pp. 251524592095492

Author(s):

Marco Del Giudice ◽

Steven W. Gangestad

Keyword(s):

Degrees Of Freedom ◽

A Priori ◽

Simulated Data ◽

Style Analysis ◽

Data Set ◽

Scant Attention ◽

Equivalence Type ◽

The Impact ◽

Biased Estimates

Decisions made by researchers while analyzing data (e.g., how to measure variables, how to handle outliers) are sometimes arbitrary, without an objective justification for choosing one alternative over another. Multiverse-style methods (e.g., specification curve, vibration of effects) estimate an effect across an entire set of possible specifications to expose the impact of hidden degrees of freedom and/or obtain robust, less biased estimates of the effect of interest. However, if specifications are not truly arbitrary, multiverse-style analyses can produce misleading results, potentially hiding meaningful effects within a mass of poorly justified alternatives. So far, a key question has received scant attention: How does one decide whether alternatives are arbitrary? We offer a framework and conceptual tools for doing so. We discuss three kinds of a priori nonequivalence among alternatives—measurement nonequivalence, effect nonequivalence, and power/precision nonequivalence. The criteria we review lead to three decision scenarios: Type E decisions (principled equivalence), Type N decisions (principled nonequivalence), and Type U decisions (uncertainty). In uncertain scenarios, multiverse-style analysis should be conducted in a deliberately exploratory fashion. The framework is discussed with reference to published examples and illustrated with the help of a simulated data set. Our framework will help researchers reap the benefits of multiverse-style methods while avoiding their pitfalls.

Download Full-text

Population Subdivision and Molecular Sequence Variation: Theory and Analysis of Drosophila ananassae Data

Genetics ◽

10.1093/genetics/165.3.1385 ◽

2003 ◽

Vol 165 (3) ◽

pp. 1385-1395

Author(s):

Claus Vogl ◽

Aparup Das ◽

Mark Beaumont ◽

Sujata Mohanty ◽

Wolfgang Stephan

Keyword(s):

Sequence Data ◽

Isolation By Distance ◽

Allele Frequencies ◽

Drosophila Ananassae ◽

Population Subdivision ◽

Variation Theory ◽

Peripheral Populations ◽

Molecular Variation ◽

Data Set ◽

Evolutionary Forces

Abstract Population subdivision complicates analysis of molecular variation. Even if neutrality is assumed, three evolutionary forces need to be considered: migration, mutation, and drift. Simplification can be achieved by assuming that the process of migration among and drift within subpopulations is occurring fast compared to mutation and drift in the entire population. This allows a two-step approach in the analysis: (i) analysis of population subdivision and (ii) analysis of molecular variation in the migrant pool. We model population subdivision using an infinite island model, where we allow the migration/drift parameter 0398; to vary among populations. Thus, central and peripheral populations can be differentiated. For inference of 0398;, we use a coalescence approach, implemented via a Markov chain Monte Carlo (MCMC) integration method that allows estimation of allele frequencies in the migrant pool. The second step of this approach (analysis of molecular variation in the migrant pool) uses the estimated allele frequencies in the migrant pool for the study of molecular variation. We apply this method to a Drosophila ananassae sequence data set. We find little indication of isolation by distance, but large differences in the migration parameter among populations. The population as a whole seems to be expanding. A population from Bogor (Java, Indonesia) shows the highest variation and seems closest to the species center.

Download Full-text

Messages of Oscillatory Correlograms: A Spike Train Model

Neural Computation ◽

10.1162/neco.2007.12-06-424 ◽

2008 ◽

Vol 20 (5) ◽

pp. 1211-1238 ◽

Cited By ~ 5

Author(s):

Gaby Schneider

Keyword(s):

Spike Train ◽

Null Model ◽

Simulated Data ◽

Joint Analysis ◽

Data Set ◽

Proposed Model ◽

Peak Asymmetry ◽

Millisecond Range ◽

Temporal Interactions ◽

Underlying Processes

Oscillatory correlograms are widely used to study neuronal activity that shows a joint periodic rhythm. In most cases, the statistical analysis of cross-correlation histograms (CCH) features is based on the null model of independent processes, and the resulting conclusions about the underlying processes remain qualitative. Therefore, we propose a spike train model for synchronous oscillatory firing activity that directly links characteristics of the CCH to parameters of the underlying processes. The model focuses particularly on asymmetric central peaks, which differ in slope and width on the two sides. Asymmetric peaks can be associated with phase offsets in the (sub-) millisecond range. These spatiotemporal firing patterns can be highly consistent across units yet invisible in the underlying processes. The proposed model includes a single temporal parameter that accounts for this peak asymmetry. The model provides approaches for the analysis of oscillatory correlograms, taking into account dependencies and nonstationarities in the underlying processes. In particular, the auto- and the cross-correlogram can be investigated in a joint analysis because they depend on the same spike train parameters. Particular temporal interactions such as the degree to which different units synchronize in a common oscillatory rhythm can also be investigated. The analysis is demonstrated by application to a simulated data set.

Download Full-text

Frequent Closed Partial Orders Mining in Sequences

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.846-847.1304 ◽

2013 ◽

Vol 846-847 ◽

pp. 1304-1307

Author(s):

Ye Wang ◽

Yan Jia ◽

Lu Min Zhang

Keyword(s):

Sequence Data ◽

Real Data ◽

Partial Orders ◽

Hard Problem ◽

Important Data ◽

Data Set ◽

Pruning Algorithm ◽

Equal Chance ◽

Np Hard Problem ◽

General Sequences

Mining partial orders from sequence data is an important data mining task with broad applications. As partial orders mining is a NP-hard problem, many efficient pruning algorithm have been proposed. In this paper, we improve a classical algorithm of discovering frequent closed partial orders from string. For general sequences, we consider items appearing together having equal chance to calculate the detecting matrix used for pruning. Experimental evaluations from a real data set show that our algorithm can effectively mine FCPO from sequences.

Download Full-text