Adaptive Estimation for Epidemic Renewal and Phylogenetic Skyline Models

AbstractEstimating temporal changes in a target population from phylogenetic or count data is an important problem in ecology and epidemiology. Reliable estimates can provide key insights into the climatic and biological drivers influencing the diversity or structure of that population and evidence hypotheses concerning its future growth or decline. In infectious disease applications, the individuals infected across an epidemic form the target population. The renewal model estimates the effective reproduction number, R, of the epidemic from counts of its observed cases. The skyline model infers the effective population size, N, underlying a phylogeny of sequences sampled from that epidemic. Practically, R measures ongoing epidemic growth while N informs on historical caseload. While both models solve distinct problems, the reliability of their estimates depends on p-dimensional piecewise-constant functions. If p is misspecified, the model might underfit significant changes or overfit noise and promote a spurious understanding of the epidemic, which might misguide intervention policies or misinform forecasts. Surprisingly, no transparent yet principled approach for optimising p exists. Usually, p is heuristically set, or obscurely controlled via complex algorithms. We present a computable and interpretable p-selection method based on the minimum description length (MDL) formalism of information theory. Unlike many standard model selection techniques, MDL accounts for the additional statistical complexity induced by how parameters interact. As a result, our method optimises p so that R and N estimates properly adapt to the available data. It also outperforms comparable Akaike and Bayesian information criteria on several classification problems. Our approach requires some knowledge of the parameter space and exposes the similarities between renewal and skyline models.

Download Full-text

Adaptive Estimation for Epidemic Renewal and Phylogenetic Skyline Models

Systematic Biology ◽

10.1093/sysbio/syaa035 ◽

2020 ◽

Vol 69 (6) ◽

pp. 1163-1179 ◽

Cited By ~ 5

Author(s):

Kris V Parag ◽

Christl A Donnelly

Keyword(s):

Information Theory ◽

Model Selection ◽

Adaptive Estimation ◽

Minimum Description Length ◽

Target Population ◽

Theory Model ◽

Information Criteria ◽

Classification Problems ◽

Effective Population ◽

Incident Cases

Abstract Estimating temporal changes in a target population from phylogenetic or count data is an important problem in ecology and epidemiology. Reliable estimates can provide key insights into the climatic and biological drivers influencing the diversity or structure of that population and evidence hypotheses concerning its future growth or decline. In infectious disease applications, the individuals infected across an epidemic form the target population. The renewal model estimates the effective reproduction number, R, of the epidemic from counts of observed incident cases. The skyline model infers the effective population size, N, underlying a phylogeny of sequences sampled from that epidemic. Practically, R measures ongoing epidemic growth while N informs on historical caseload. While both models solve distinct problems, the reliability of their estimates depends on p-dimensional piecewise-constant functions. If p is misspecified, the model might underfit significant changes or overfit noise and promote a spurious understanding of the epidemic, which might misguide intervention policies or misinform forecasts. Surprisingly, no transparent yet principled approach for optimizing p exists. Usually, p is heuristically set, or obscurely controlled via complex algorithms. We present a computable and interpretable p-selection method based on the minimum description length (MDL) formalism of information theory. Unlike many standard model selection techniques, MDL accounts for the additional statistical complexity induced by how parameters interact. As a result, our method optimizes p so that R and N estimates properly and meaningfully adapt to available data. It also outperforms comparable Akaike and Bayesian information criteria on several classification problems, given minimal knowledge of the parameter space, and exposes statistical similarities among renewal, skyline, and other models in biology. Rigorous and interpretable model selection is necessary if trustworthy and justifiable conclusions are to be drawn from piecewise models. [Coalescent processes; epidemiology; information theory; model selection; phylodynamics; renewal models; skyline plots]

Download Full-text

A demographic model to predict future growth of the Addo elephant population

Koedoe ◽

10.4102/koedoe.v42i1.226 ◽

1999 ◽

Vol 42 (1) ◽

Cited By ~ 10

Author(s):

A.M. Woodd

Keyword(s):

Population Size ◽

National Park ◽

Target Population ◽

Demographic Model ◽

Elephant Population ◽

Age Structured ◽

Term Data ◽

Future Population ◽

Future Growth

An age-structured demographic model of the growth of the Addo elephant population was developed using parameters calculated from long-term data on the population. The model was used to provide estimates of future population growth. Expansion of the Addo Elephant National Park is currently underway, and the proposed target population size for elephant within the enlarged park is 2700. The model predicts that this population size will be reached during the year 2043, so that the Addo elephant population can continue to increase for a further 44 years before its target size within the enlarged park is attained.

Download Full-text

On the minimum description length principle for sources with piecewise constant parameters

IEEE Transactions on Information Theory ◽

10.1109/18.265504 ◽

1993 ◽

Vol 39 (6) ◽

pp. 1962-1967 ◽

Cited By ~ 35

Author(s):

N. Merhav

Keyword(s):

Minimum Description Length ◽

Minimum Description Length Principle ◽

Piecewise Constant

Download Full-text

Phylogenomics and phylodynamics of SARS-CoV-2 retrieved genomes from India

10.1101/2020.06.23.20138222 ◽

2020 ◽

Author(s):

Sameera Farah ◽

Ashwin Atkulwar ◽

Manas Ranjan Praharaj ◽

Raja Khan ◽

Ravikumar Gandham ◽

...

Keyword(s):

Genome Sequencing ◽

Vaccine Development ◽

Reproduction Number ◽

Bayesian Skyline Plot ◽

Development Programs ◽

Effective Population ◽

Whole Genomes ◽

Indian Isolates ◽

Spanish Flu ◽

Community Transmission

AbstractThe ongoing SARS-CoV-2 pandemic is one of the biggest outbreaks after the Spanish flu of 1918. Understanding the epidemiology of viral outbreaks is the first step towards vaccine development programs. This is the first phylodynamics study attempted on of SARS-CoV-2 genomes from India to infer its current evolution in the context of an ongoing pandemic. Out of 286 retrieved SARS-CoV-2 whole genomes from India, 138 haplotypes were generated and analyzed. Median-joining network was built to investigate the relatedness of SARS-CoV-2 haplotypes in India. The BDSIR package of BEAST2 was used to calculate the reproduction number (R0) and the infectious rate of the virus. Past and current population trend was investigated using the stamp date method in coalescent Bayesian skyline plot, implemented in BEAST2 and by exponential growth prior in BEAST 1.10.4. Median-joining network reveals two distinct ancestral clusters A and B showing genetic affinities with Wuhan outbreak sample. The network also illustrates the autochthonous development of isolates in a few instances. High basic reproduction number of SARS-nCoV-2 in India points towards the phase of active community transmission. The Bayesian skyline plot revel exponential rise in the effective population size (Ne) of Indian isolates from the first week of January to the first week of April 2020. More genome sequencing and analyses of the virus will be required in coming days to monitor COVID19 after the upliftment of lock down in India.

Download Full-text

Long-tailed distributions of inter-event times as mixtures of exponential distributions

Royal Society Open Science ◽

10.1098/rsos.191643 ◽

2020 ◽

Vol 7 (2) ◽

pp. 191643 ◽

Cited By ~ 2

Author(s):

Makoto Okada ◽

Kenji Yamanishi ◽

Naoki Masuda

Keyword(s):

Latent Variables ◽

State Of The Art ◽

Minimum Description Length ◽

Information Criteria ◽

Human Behaviour ◽

Exponential Distributions ◽

Likelihood Estimator ◽

Event Times ◽

Parameter Values ◽

Power Law Distributions

Inter-event times of various human behaviour are apparently non-Poissonian and obey long-tailed distributions as opposed to exponential distributions, which correspond to Poisson processes. It has been suggested that human individuals may switch between different states, in each of which they are regarded to generate events obeying a Poisson process. If this is the case, inter-event times should approximately obey a mixture of exponential distributions with different parameter values. In the present study, we introduce the minimum description length principle to compare mixtures of exponential distributions with different numbers of components (i.e. constituent exponential distributions). Because these distributions violate the identifiability property, one is mathematically not allowed to apply the Akaike or Bayes information criteria to their maximum-likelihood estimator to carry out model selection. We overcome this theoretical barrier by applying a minimum description principle to joint likelihoods of the data and latent variables. We show that mixtures of exponential distributions with a few components are selected, as opposed to more complex mixtures in various datasets, and that the fitting accuracy is comparable to that of state-of-the-art algorithms to fit power-law distributions to data. Our results lend support to Poissonian explanations of apparently non-Poissonian human behaviour.

Download Full-text

Clustgrams: an extension to histogram densities based on the minimum description length principle

Open Computer Science ◽

10.2478/s13537-011-0033-x ◽

2011 ◽

Vol 1 (4) ◽

Cited By ~ 1

Author(s):

Panu Luosto ◽

Petri Kontkanen

Keyword(s):

Machine Learning ◽

Density Estimation ◽

Minimum Description Length ◽

Complex Problem ◽

Density Functions ◽

Large Sample Size ◽

Minimum Description Length Principle ◽

Piecewise Constant ◽

Problem Instances ◽

Model Selection Problem

AbstractDensity estimation is one of the most important problems in statistical inference and machine learning. A common approach to the problem is to use histograms, i.e., piecewise constant densities. Histograms are flexible and can adapt to any density given enough bins. However, due to the simplicity of histograms, a large number of parameters and a large sample size might be needed for learning an accurate density, especially in more complex problem instances. In this paper, we extend the histogram density estimation framework by introducing a model called clustgram, which uses arbitrary density functions as components of the density rather than just uniform components. The new model is based on finding a clustering of the sample points and determining the type of the density function for each cluster. We regard the problem of learning clustgrams as a model selection problem and use the theoretically appealing minimum description length principle for solving the task.

Download Full-text

SpectralTDF: transition densities of diffusion processes with time-varying selection parameters, mutation rates, and effective population sizes

10.1101/029736 ◽

2015 ◽

Cited By ~ 1

Author(s):

Matthias Steinrücken ◽

Ethan M Jewett ◽

Yun S Song

Keyword(s):

Spectral Representation ◽

Diffusion Processes ◽

Transition Density ◽

Effective Population ◽

Piecewise Constant ◽

Practical Applications ◽

Transition Density Function ◽

Transition Densities ◽

Population Sizes ◽

Selection Parameters

In the Wright-Fisher diffusion, the transition density function (TDF) describes the time-evolution of the population-wide frequency of an allele. This function has several practical applications in population genetics, and computing it for biologically realistic scenarios with selection and demography is an important problem. We develop an efficient method for finding a spectral representation of the TDF for a general model where the effective population size, selection coefficients, and mutation parameters vary over time in a piecewise constant manner. The method, called SpectralTDF, is available at https://sourceforge.net/projects/spectraltdf/.

Download Full-text

Prioritization of the Target Population for Coronavirus Disease 2019 (COVID-19) Vaccination Program in Thailand

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph182010803 ◽

2021 ◽

Vol 18 (20) ◽

pp. 10803

Author(s):

Rapeepong Suphanchaimat ◽

Titiporn Tuangratananon ◽

Nattadhanai Rajatanavin ◽

Mathudara Phaiyarom ◽

Warisara Jaruwanno ◽

...

Keyword(s):

Incremental Cost ◽

Reproduction Number ◽

Vaccination Strategy ◽

Cost Effective ◽

Target Population ◽

Social Acceptability ◽

Deterministic System ◽

Vaccination Policy ◽

Policy Scenario ◽

Second Wave

Thailand was hit by the second wave of Coronavirus Disease 2019 (COVID-19) in a densely migrant-populated province (Samut Sakhon). COVID-19 vaccines were known to be effective; however, the supply was limited. Therefore, this study aimed to predict the effectiveness of Thailand’s COVID-19 vaccination strategy. We obtained most of the data from the Ministry of Public Health. Deterministic system dynamics and compartmental models were utilized. The reproduction number (R) between Thais and migrants was estimated at 1.25 and 2.5, respectively. Vaccine effectiveness (VE) to prevent infection was assumed at 50%. In Samut Sakhon, there were 500,000 resident Thais and 360,000 resident migrants. The contribution of migrants to the province’s gross domestic product was estimated at 20%. Different policy scenarios were analyzed. The migrant-centric vaccination policy scenario received the lowest incremental cost per one case or one death averted compared with the other scenarios. The Thai-centric policy scenario yielded an incremental cost of 27,191 Baht per one life saved, while the migrant-centric policy scenario produced a comparable incremental cost of 3782 Baht. Sensitivity analysis also demonstrated that the migrant-centric scenario presented the most cost-effective outcome even when VE diminished to 20%. A migrant-centric policy yielded the smallest volume of cumulative infections and deaths and was the most cost-effective scenario, independent of R and VE values. Further studies should address political feasibility and social acceptability of migrant vaccine prioritization.

Download Full-text

Word2vec Skip-Gram Dimensionality Selection via Sequential Normalized Maximum Likelihood

Entropy ◽

10.3390/e23080997 ◽

2021 ◽

Vol 23 (8) ◽

pp. 997

Author(s):

Pham Thuc Hung ◽

Kenji Yamanishi

Keyword(s):

Probability Theory ◽

Maximum Likelihood ◽

Bayesian Information Criterion ◽

Minimum Description Length ◽

Information Criterion ◽

Information Criteria ◽

Efficient Computation ◽

Data Sequence ◽

Word Similarity ◽

Normalized Maximum Likelihood

In this paper, we propose a novel information criteria-based approach to select the dimensionality of the word2vec Skip-gram (SG). From the perspective of the probability theory, SG is considered as an implicit probability distribution estimation under the assumption that there exists a true contextual distribution among words. Therefore, we apply information criteria with the aim of selecting the best dimensionality so that the corresponding model can be as close as possible to the true distribution. We examine the following information criteria for the dimensionality selection problem: the Akaike’s Information Criterion (AIC), Bayesian Information Criterion (BIC), and Sequential Normalized Maximum Likelihood (SNML) criterion. SNML is the total codelength required for the sequential encoding of a data sequence on the basis of the minimum description length. The proposed approach is applied to both the original SG model and the SG Negative Sampling model to clarify the idea of using information criteria. Additionally, as the original SNML suffers from computational disadvantages, we introduce novel heuristics for its efficient computation. Moreover, we empirically demonstrate that SNML outperforms both BIC and AIC. In comparison with other evaluation methods for word embedding, the dimensionality selected by SNML is significantly closer to the optimal dimensionality obtained by word analogy or word similarity tasks.

Download Full-text

Phylogenomics and phylodynamics of SARS-CoV-2 genomes retrieved from India

Future Virology ◽

10.2217/fvl-2020-0243 ◽

2020 ◽

Vol 15 (11) ◽

pp. 747-753

Author(s):

Sameera Farah ◽

Ashwin Atkulwar ◽

Manas Ranjan Praharaj ◽

Raja Khan ◽

Ravikumar Gandham ◽

...

Keyword(s):

Population Dynamics ◽

Severe Acute Respiratory Syndrome ◽

Basic Reproduction Number ◽

Reproduction Number ◽

Phylogenetic Network ◽

Effective Population ◽

Current State ◽

Whole Genomes ◽

Indian Isolates ◽

Community Transmission

Background: This is the first phylodynamic study attempted on SARS-CoV-2 genomes from India to infer the current state of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) evolution using phylogenetic network and growth trends. Materials & Methods: Out of 286 retrieved whole genomes from India, 138 haplotypes were used to build a phylogenetic network. The birth–death serial model (BDSIR) package of BEAST2 was used to calculate the reproduction number of SARS-CoV-2. Population dynamics were investigated using the stamp date method as implemented in BEAST2 and BEAST 1.10.4. Results: A median-joining network revealed two ancestral clusters. A high basic reproduction number of SARS-CoV-2 was found. An exponential rise in the effective population size of Indian isolates was detected. Conclusion: The phylogenetic network reveals dual ancestry and possibility of community transmission of SARS-CoV-2 in India.

Download Full-text