scholarly journals Avoiding ascertainment bias in the maximum likelihood inference of phylogenies based on truncated data

2017 ◽  
Author(s):  
Asif Tamuri ◽  
Nick Goldman

AbstractSome phylogenetic datasets omit data matrix positions at which all taxa share the same state. For sequence data this may be because of a focus on single nucleotide polymorphisms (SNPs) or the use of a technique such as restriction site-associated DNA sequencing (RADseq) that concentrates attention onto regions of differences. With morphological data, it is common to omit states that show no variation across the data studied. It is already known that failing to correct for the ascertainment bias of omitting constant positions can lead to overestimates of evolutionary divergence, as the lack of constant sites is explained as high divergence rather than as a deliberate data selection technique. Previous approaches to using corrections to the likelihood function in order to avoid ascertainment bias have either required knowledge of the omitted positions, or have modified the likelihood function to reflect the omitted data. In this paper we indicate that the technique used to date for this latter approach is a conditional maximum likelihood (CML) method. An alternative approach — unconditional maximum likelihood (UML) — is also possible. We investigate the performance of CML and UML and find them to have almost identical performance in the phylogenetic SNP dataset context. We also make some observations about the nucleotide frequencies observed in SNP datasets, indicating that these can differ systematically from the overall equilibrium base frequencies of the substitution process. This suggests that model parameters representing base frequencies should be estimated by maximum likelihood, and not by empirical (counting) methods.

2011 ◽  
Vol 23 (11) ◽  
pp. 2833-2867 ◽  
Author(s):  
Yi Dong ◽  
Stefan Mihalas ◽  
Alexander Russell ◽  
Ralph Etienne-Cummings ◽  
Ernst Niebur

When a neuronal spike train is observed, what can we deduce from it about the properties of the neuron that generated it? A natural way to answer this question is to make an assumption about the type of neuron, select an appropriate model for this type, and then choose the model parameters as those that are most likely to generate the observed spike train. This is the maximum likelihood method. If the neuron obeys simple integrate-and-fire dynamics, Paninski, Pillow, and Simoncelli ( 2004 ) showed that its negative log-likelihood function is convex and that, at least in principle, its unique global minimum can thus be found by gradient descent techniques. Many biological neurons are, however, known to generate a richer repertoire of spiking behaviors than can be explained in a simple integrate-and-fire model. For instance, such a model retains only an implicit (through spike-induced currents), not an explicit, memory of its input; an example of a physiological situation that cannot be explained is the absence of firing if the input current is increased very slowly. Therefore, we use an expanded model (Mihalas & Niebur, 2009 ), which is capable of generating a large number of complex firing patterns while still being linear. Linearity is important because it maintains the distribution of the random variables and still allows maximum likelihood methods to be used. In this study, we show that although convexity of the negative log-likelihood function is not guaranteed for this model, the minimum of this function yields a good estimate for the model parameters, in particular if the noise level is treated as a free parameter. Furthermore, we show that a nonlinear function minimization method (r-algorithm with space dilation) usually reaches the global minimum.


Author(s):  
Jay D. Martin

A kriging model can be used as a surrogate to a more computationally expensive model or simulation. It is capable of providing a continuous mathematical relationship that can interpolate a set of observations. One of the major issues with using kriging models is the potentially computationally expensive process of estimating the best model parameters. One of the most common methods used to estimate model parameters is Maximum Likelihood Estimation (MLE). MLE of kriging model parameters requires the use of numerical optimization of a continuous but possibly multi-modal log-likelihood function. This paper presents some enhancements to gradient-based methods to make them more computationally efficient and compares the potential reduction in computational burden. These enhancements include the development of the analytic gradient and Hessian for the log-likelihood equation of a kriging model that uses a Gaussian spatial correlation function. The suggested algorithm is very similar to the Scoring algorithm traditionally used in statistics, a Newton-Raphson gradient-based optimization method.


2004 ◽  
Vol 16 (12) ◽  
pp. 2533-2561 ◽  
Author(s):  
Liam Paninski ◽  
Jonathan W. Pillow ◽  
Eero P. Simoncelli

We examine a cascade encoding model for neural response in which a linear filtering stage is followed by a noisy, leaky, integrate-and-fire spike generation mechanism. This model provides a biophysically more realistic alternative to models based on Poisson (memoryless) spike generation, and can effectively reproduce a variety of spiking behaviors seen in vivo. We describe the maximum likelihood estimator for the model parameters, given only extracellular spike train responses (not intracellular voltage data). Specifically, we prove that the log-likelihood function is concave and thus has an essentially unique global maximum that can be found using gradient ascent techniques. We develop an efficient algorithm for computing the maximum likelihood solution, demonstrate the effectiveness of the resulting estimator with numerical simulations, and discuss a method of testing the model's validity using time-rescaling and density evolution techniques.


2020 ◽  
Vol 98 (Supplement_4) ◽  
pp. 477-477
Author(s):  
Leah K Treffer ◽  
Edward S Rice ◽  
Anna M Fuller ◽  
Samuel Cutler ◽  
Jessica L Petersen

Abstract Domestic yak (Bos grunniens) are bovids native to the Asian Qinghai-Tibetan Plateau. Studies of Asian yak have revealed that introgression with domestic cattle has contributed to the evolution of the species. When imported to North America (NA), some hybridization with B. taurus did occur. The objective of this study was to use mitochondrial (mt) DNA sequence data to better understand the mtDNA origin of NA yak and their relationship to Asian yak and related species. The complete mtDNA sequence of 14 individuals (12 NA yak, 1 Tibetan yak, 1 Tibetan B. indicus) was generated and compared with sequences of similar species from GeneBank (B. indicus, B. grunniens (Chinese), B. taurus, B. gaurus, B. primigenius, B. frontalis, Bison bison, and Ovis aries). Individuals were aligned to the B. grunniens reference genome (ARS_UNL_BGru_maternal_1.0), which was also included in the analyses. The mtDNA genes were annotated using the ARS-UCD1.2 cattle sequence as a reference. Ten unique NA yak haplotypes were identified, which a haplotype network separated into two clusters. Variation among the NA haplotypes included 93 nonsynonymous single nucleotide polymorphisms. A maximum likelihood tree including all taxa was made using IQtree after the data were partitioned into twenty-two subgroups using PartitionFinder2. Notably, six NA yak haplotypes formed a clade with B. indicus; the other four haplotypes grouped with B. grunniens and fell as a sister clade to bison, gaur and gayal. These data demonstrate two mitochondrial origins of NA yak with genetic variation in protein coding genes. Although these data suggest yak introgression with B. indicus, it appears to date prior to importation into NA. In addition to contributing to our understanding of the species history, these results suggest the two major mtDNA haplotypes in NA yak may functionally differ. Characterization of the impact of these differences on cellular function is currently underway.


Genetics ◽  
2000 ◽  
Vol 155 (3) ◽  
pp. 1429-1437
Author(s):  
Oliver G Pybus ◽  
Andrew Rambaut ◽  
Paul H Harvey

Abstract We describe a unified set of methods for the inference of demographic history using genealogies reconstructed from gene sequence data. We introduce the skyline plot, a graphical, nonparametric estimate of demographic history. We discuss both maximum-likelihood parameter estimation and demographic hypothesis testing. Simulations are carried out to investigate the statistical properties of maximum-likelihood estimates of demographic parameters. The simulations reveal that (i) the performance of exponential growth model estimates is determined by a simple function of the true parameter values and (ii) under some conditions, estimates from reconstructed trees perform as well as estimates from perfect trees. We apply our methods to HIV-1 sequence data and find strong evidence that subtypes A and B have different demographic histories. We also provide the first (albeit tentative) genetic evidence for a recent decrease in the growth rate of subtype B.


2013 ◽  
Vol 2013 ◽  
pp. 1-13 ◽  
Author(s):  
Helena Mouriño ◽  
Maria Isabel Barão

Missing-data problems are extremely common in practice. To achieve reliable inferential results, we need to take into account this feature of the data. Suppose that the univariate data set under analysis has missing observations. This paper examines the impact of selecting an auxiliary complete data set—whose underlying stochastic process is to some extent interdependent with the former—to improve the efficiency of the estimators for the relevant parameters of the model. The Vector AutoRegressive (VAR) Model has revealed to be an extremely useful tool in capturing the dynamics of bivariate time series. We propose maximum likelihood estimators for the parameters of the VAR(1) Model based on monotone missing data pattern. Estimators’ precision is also derived. Afterwards, we compare the bivariate modelling scheme with its univariate counterpart. More precisely, the univariate data set with missing observations will be modelled by an AutoRegressive Moving Average (ARMA(2,1)) Model. We will also analyse the behaviour of the AutoRegressive Model of order one, AR(1), due to its practical importance. We focus on the mean value of the main stochastic process. By simulation studies, we conclude that the estimator based on the VAR(1) Model is preferable to those derived from the univariate context.


2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 76-76
Author(s):  
Seyed Milad Vahedi ◽  
Karim Karimi ◽  
Siavash Salek Ardestani ◽  
Younes Miar

Abstract Aleutian disease (AD) is a chronic persistent infection in domestic mink caused by Aleutian mink disease virus (AMDV). Female mink’s fertility and pelt quality depression are the main reasons for the AD’s negative economic impacts on the mink industry. A total number of 79 American mink from the Canadian Center for Fur Animal Research at Dalhousie University (Truro, NS, Canada) were classified based on the results of counter immunoelectrophoresis (CIEP) tests into two groups of positive (n = 48) and negative (n = 31). Whole-genome sequences comprising 4,176 scaffolds and 8,039,737 single nucleotide polymorphisms (SNPs) were used to trace the selection footprints for response to AMDV infection at the genome level. Window-based fixation index (Fst) and nucleotide diversity (θπ) statistics were estimated to compare positive and negative animals’ genomes. The overlapped top 1% genomic windows between two statistics were considered as potential regions underlying selection pressures. A total of 98 genomic regions harboring 33 candidate genes were detected as selective signals. Most of the identified genes were involved in the development and functions of immune system (PPP3CA, SMAP2, TNFRSF21, SKIL, and AKIRIN2), musculoskeletal system (COL9A2, PPP1R9A, ANK2, AKAP9, and STRIT1), nervous system (ASCL1, ZFP69B, SLC25A27, MCF2, and SLC7A14), reproductive system (CAMK2D, GJB7, SSMEM1, C6orf163), liver (PAH and DPYD), and lung (SLC35A1). Gene-expression network analysis showed the interactions among 27 identified genes. Moreover, pathway enrichment analysis of the constructed genes network revealed significant oxytocin (KEGG: hsa04921) and GnRH signaling (KEGG: hsa04912) pathways, which are likely to be impaired by AMDV leading to dams’ fecundity reduction. These results provided a perspective to the genetic architecture of response to AD in American mink and novel insight into the pathogenesis of AMDV.


2012 ◽  
Vol 2 (1) ◽  
pp. 7 ◽  
Author(s):  
Andrzej Kijko

This work is focused on the Bayesian procedure for the estimation of the regional maximum possible earthquake magnitude <em>m</em><sub>max</sub>. The paper briefly discusses the currently used Bayesian procedure for m<sub>max</sub>, as developed by Cornell, and a statistically justifiable alternative approach is suggested. The fundamental problem in the application of the current Bayesian formalism for <em>m</em><sub>max</sub> estimation is that one of the components of the posterior distribution is the sample likelihood function, for which the range of observations (earthquake magnitudes) depends on the unknown parameter <em>m</em><sub>max</sub>. This dependence violates the property of regularity of the maximum likelihood function. The resulting likelihood function, therefore, reaches its maximum at the maximum observed earthquake magnitude <em>m</em><sup>obs</sup><sub>max</sub> and not at the required maximum <em>possible</em> magnitude <em>m</em><sub>max</sub>. Since the sample likelihood function is a key component of the posterior distribution, the posterior estimate of <em>m^</em><sub>max</sub> is biased. The degree of the bias and its sign depend on the applied Bayesian estimator, the quantity of information provided by the prior distribution, and the sample likelihood function. It has been shown that if the maximum posterior estimate is used, the bias is negative and the resulting underestimation of <em>m</em><sub>max</sub> can be as big as 0.5 units of magnitude. This study explores only the maximum posterior estimate of <em>m</em><sub>max</sub>, which is conceptionally close to the classic maximum likelihood estimation. However, conclusions regarding the shortfall of the current Bayesian procedure are applicable to all Bayesian estimators, <em>e.g.</em> posterior mean and posterior median. A simple, <em>ad hoc</em> solution of this non-regular maximum likelihood problem is also presented.


Sign in / Sign up

Export Citation Format

Share Document