Phylogenetic Estimation of Timescales Using Ancient DNA: The Effects of Temporal Sampling Scheme and Uncertainty in Sample Ages

Martyna Molak; Eline D. Lorenzen; Beth Shapiro; Simon Y.W. Ho

doi:10.1093/molbev/mss232

Phylogenetic Estimation of Timescales Using Ancient DNA: The Effects of Temporal Sampling Scheme and Uncertainty in Sample Ages

Molecular Biology and Evolution ◽

10.1093/molbev/mss232 ◽

2012 ◽

Vol 30 (2) ◽

pp. 253-262 ◽

Cited By ~ 31

Author(s):

Martyna Molak ◽

Eline D. Lorenzen ◽

Beth Shapiro ◽

Simon Y.W. Ho

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Estimation Error ◽

Phylogenetic Analyses ◽

Published Data ◽

Molecular Clocks ◽

Data Sets ◽

Data Set ◽

Temporal Sampling ◽

The Impact

Abstract In recent years, ancient DNA has increasingly been used for estimating molecular timescales, particularly in studies of substitution rates and demographic histories. Molecular clocks can be calibrated using temporal information from ancient DNA sequences. This information comes from the ages of the ancient samples, which can be estimated by radiocarbon dating the source material or by dating the layers in which the material was deposited. Both methods involve sources of uncertainty. The performance of Bayesian phylogenetic inference depends on the information content of the data set, which includes variation in the DNA sequences and the structure of the sample ages. Various sources of estimation error can reduce our ability to estimate rates and timescales accurately and precisely. We investigated the impact of sample-dating uncertainties on the estimation of evolutionary timescale parameters using the software BEAST. Our analyses involved 11 published data sets and focused on estimates of substitution rate and root age. We show that, provided that samples have been accurately dated and have a broad temporal span, it might be unnecessary to account for sample-dating uncertainty in Bayesian phylogenetic analyses of ancient DNA. We also investigated the sample size and temporal span of the ancient DNA sequences needed to estimate phylogenetic timescales reliably. Our results show that the range of sample ages plays a crucial role in determining the quality of the results but that accurate and precise phylogenetic estimates of timescales can be made even with only a few ancient sequences. These findings have important practical consequences for studies of molecular rates, timescales, and population dynamics.

Download Full-text

Phylogeny of the water strider genus Aquarius Schellenberg (Heteroptera: Gerridae) based on nuclear and mitochondrial DNA sequences and morphology

Insect Systematics & Evolution ◽

10.1163/187631200x00327 ◽

2000 ◽

Vol 31 (1) ◽

pp. 71-90 ◽

Cited By ~ 26

Author(s):

Nils Møller Andersen ◽

Jakob Damgaard ◽

Felix A.H. Sperling

Keyword(s):

Dna Sequences ◽

Sequence Data ◽

Mitochondrial Gene ◽

Nuclear Gene ◽

Molecular Data ◽

Morphological Data ◽

Published Data ◽

Data Sets ◽

Gene Encoding ◽

Data Set

AbstractWe examined phylogenetic relationships among gerrid water striders of the genus Aquarius Schellenberg using molecular and morphological characters. The molecular data sets included 780 bp sequence data from the mitochondrial gene encoding cytochrome oxidase subunit I (COI), and 515 bp sequence data from the nuclear gene encoding elongation factor I alpha (EF-1α). The morphological data set was a slightly modified version of a previously published data set. We included all 17 known species and one subspecies of Aquarius as well as five species from three related genera, Gigantometra gigas, Limnoporus esakii, L. rufoscutellatus, Gerris pingreensis, and G. lacustris. Unweighted parsimony analyses of the COI data set gave a single most parsimonious tree (MPT) with a topology quite similar to the morphological tree. Parsimony analyses of the EF-1α data set gave 3 MPT's and a strict consensus of these trees gave a tree with a slightly different topology. A combined analysis of the three data sets gave a single MPT with the same topology as for the morphological data set alone. The phylogeny of Aquarius presented here supports the monophyly of the A. najas, remigis, conformis and paludum species groups as well as previous hypotheses about their relationships. On the other hand, the inclusion of molecular data weakens the support for the monophyly of the genus Aquarius, and questions the specific status of the eastern North American A. nebularis (as separate from A. conformis) and members of the Nearctic A. remigis group. Finally, we discuss the implications of the reconstructed phylogeny in the biogeography and ecological phylogenetics of Aquarius.

Download Full-text

Aggregation and distribution of strains in microparasites

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.1999.0432 ◽

1999 ◽

Vol 354 (1384) ◽

pp. 799-807 ◽

Cited By ~ 37

Author(s):

C. C. Lord ◽

B. Barnard ◽

K. Day ◽

J. W. Hargrove ◽

J. J. McNamara ◽

...

Keyword(s):

Surface Protein ◽

Merozoite Surface Protein ◽

Multiple Infections ◽

Tsetse Flies ◽

Published Data ◽

Data Sets ◽

Data Set ◽

Force Of Infection ◽

Different Strains ◽

The Impact

Recent research has shown that many parasite populations are made up of a number of epidemiologically distinct strains or genotypes. The implications of strain structure or genetic diversity for parasite population dynamics are still uncertain, partly because there is no coherent framework for the interpretation of field data. Here, we present an analysis of four published data sets for vector–borne microparasite infections where strains or genotypes have been distinguished: serotypes of African horse sickness (AHS) in zebra; types of Nannomonas trypanosomes in tsetse flies; parasite–induced erythrocyte surface antigen (PIESA) based isolates of Plasmodium falciparum malaria in humans, and the merozoite surface protein 2 gene (MSP–2) alleles of P. falciparum in humans and in anopheline mosquitoes. For each data set we consider the distribution of strains or types among hosts and any pairwise associations between strains or types. Where host age data are available we also compare age–prevalence relationships and estimates of the force–of–infection. Multiple infections of hosts are common and for most data sets infections have an aggregated distribution among hosts with a tendency towards positive associations between certain strains or types. These patterns could result from interactions (facilitation) between strains or types, or they could reflect patterns of contact between hosts and vectors. We use a mathematical model to illustrate the impact of host–vector contact patterns, finding that even if contact is random there may still be significant aggregation in parasite distributions. This effect is enhanced if there is non–random contact or other heterogeneities between hosts, vectors or parasites. In practice, different strains or types also have different forces of infection. We anticipate that aggregated distributions and positive associations between microparasite strains or types will be extremely common.

Download Full-text

Evaluating the Effects of a Programming Error on a Virtual Environment Measure of Spatial Navigation Behavior

10.31234/osf.io/c8a5z ◽

2021 ◽

Author(s):

Steven Marc Weisberg ◽

Victor Roger Schinazi ◽

Andrea Ferrario ◽

Nora Newcombe

Keyword(s):

Virtual Environment ◽

Large Scale ◽

Spatial Navigation ◽

Large Data ◽

Published Data ◽

Data Sets ◽

Data Set ◽

Technical Errors ◽

Programming Error ◽

The Impact

Relying on shared tasks and stimuli to conduct research can enhance the replicability of findings and allow a community of researchers to collect large data sets across multiple experiments. This approach is particularly relevant for experiments in spatial navigation, which often require the development of unfamiliar large-scale virtual environments to test participants. One challenge with shared platforms is that undetected technical errors, rather than being restricted to individual studies, become pervasive across many studies. Here, we discuss the discovery of a programming error (a bug) in a virtual environment platform used to investigate individual differences in spatial navigation: Virtual Silcton. The bug resulted in storing the absolute value of an angle in a pointing task rather than the signed angle. This bug was difficult to detect for several reasons, and it rendered the original sign of the angle unrecoverable. To assess the impact of the error on published findings, we collected a new data set for comparison. Our results revealed that the effect of the error on published data is likely to be minimal, partially explaining the difficulty in detecting the bug over the years. We also used the new data set to develop a tool that allows researchers who have previously used Virtual Silcton to evaluate the impact of the bug on their findings. We summarize the ways that shared open materials, shared data, and collaboration can pave the way for better science to prevent errors in the future.

Download Full-text

Deep graph convolutional network for US birth data harmonization (Preprint)

10.2196/preprints.25318 ◽

2020 ◽

Author(s):

Xiaoqian Jiang ◽

Lishan Yu ◽

Hamisu M. Salihub ◽

Deepa Dongarwar

Keyword(s):

United States ◽

Machine Learning ◽

The United States ◽

Published Data ◽

Data Sets ◽

Convolutional Network ◽

Data Set ◽

Data Files ◽

Birth Data ◽

The Impact

BACKGROUND In the United States, State laws require birth certificates to be completed for all births; and federal law mandates national collection and publication of births and other vital statistics data. National Center for Health Statistics (NCHS) has published the key statistics of birth data over the years. These data files, from as early as the 1970s, have been released and made publicly available. There are about 3 million new births each year, and every birth is a record in the data set described by hundreds of variables. The total data cover more than half of the current US population, making it an invaluable resource to study and examine birth epidemiology. Using such big data, researchers can ask interesting questions and study longitudinal patterns, for example, the impact of mother's drinking status to infertility in metropolitans in the last decade, or the education level of the biological father to the c-sections over the years. However, existing published data sets cannot directly support these research questions as there are adjustments to the variables and their categories, which makes these individually published data files fragmented. The information contained in the published data files is highly diverse, containing hundreds of variables each year. Besides minor adjustments like renaming and increasing variable categories, some major updates significantly changed the fields of statistics (including removal, addition, and modification of the variables), making the published data disconnected and ambiguous to use over multiple years. Researchers have previously reconstructed features to study temporal patterns, but the scale is limited (focusing only on a few variables of interest). Many have reinvented the wheels, and such reconstructions lack consistency as different researchers might use different criteria to harmonize variables, leading to inconsistent findings and limiting the reproducibility of research. There is no systematic effort to combine about five decades of data files into a database that includes every variable that has ever been released by NCHS. OBJECTIVE To utilize machine learning techniques to combine the United States (US) natality data for the last five decades, with changing variables and factors, into a consistent database. METHODS We developed a feasible and efficient deep-learning-based framework to harmonize data sets of live births in the US from 1970 to 2018. We constructed a graph based on the property and elements of databases including variables and conducted a graph convolutional network (GCN) on the graph to learn the graph embeddings for nodes where the learned embeddings implied the similarity of variables. We devised a novel loss function with a slack margin and a banlist mechanism (for a random walk) to learn the desired structure (two nodes sharing more information were more similar to each other.). We developed an active learning mechanism to conduct the harmonization. RESULTS We harmonized historical US birth data and resolved conflicts in ambiguous terms. From a total of 9,321 variables (i.e., 783 stemmed variables, from 1970 to 2018) we applied our model iteratively together with human review, obtaining 323 hyperchains of variables. Hyperchains for harmonization were composed of 201 stemmed variable pairs when considering any pairs of different stemmed variables changed over years. During the harmonization, the first round of our model provided 305 candidates stemmed variable pairs (based on the top-20 most similar variables of each variable based on the learned embeddings of variables) and achieved recall and precision of 87.56%, 57.70%, respectively. CONCLUSIONS Our harmonized graph neural network (HGNN) method provides a feasible and efficient way to connect relevant databases at a meta-level. Adapting to databases' property and characteristics, HGNN can learn patterns and search relations globally, which is powerful to discover the similarity between variables among databases. Smart utilization of machine learning can significantly reduce the manual effort in database harmonization and integration of fragmented data into useful databases for future research.

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

Systematic benchmark of ancient DNA read mapping

Briefings in Bioinformatics ◽

10.1093/bib/bbab076 ◽

2021 ◽

Author(s):

Adrien Oliva ◽

Raymond Tobler ◽

Alan Cooper ◽

Bastien Llamas ◽

Yassine Souilmi

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Population Genetic ◽

Reference Genome ◽

Population Data ◽

Human Populations ◽

Current Standard ◽

Read Mapping ◽

Reference Bias ◽

The Impact

Abstract The current standard practice for assembling individual genomes involves mapping millions of short DNA sequences (also known as DNA ‘reads’) against a pre-constructed reference genome. Mapping vast amounts of short reads in a timely manner is a computationally challenging task that inevitably produces artefacts, including biases against alleles not found in the reference genome. This reference bias and other mapping artefacts are expected to be exacerbated in ancient DNA (aDNA) studies, which rely on the analysis of low quantities of damaged and very short DNA fragments (~30–80 bp). Nevertheless, the current gold-standard mapping strategies for aDNA studies have effectively remained unchanged for nearly a decade, during which time new software has emerged. In this study, we used simulated aDNA reads from three different human populations to benchmark the performance of 30 distinct mapping strategies implemented across four different read mapping software—BWA-aln, BWA-mem, NovoAlign and Bowtie2—and quantified the impact of reference bias in downstream population genetic analyses. We show that specific NovoAlign, BWA-aln and BWA-mem parameterizations achieve high mapping precision with low levels of reference bias, particularly after filtering out reads with low mapping qualities. However, unbiased NovoAlign results required the use of an IUPAC reference genome. While relevant only to aDNA projects where reference population data are available, the benefit of using an IUPAC reference demonstrates the value of incorporating population genetic information into the aDNA mapping process, echoing recent results based on graph genome representations.

Download Full-text

Relative testis size and mating systems in anurans: large testis in multiple-male mating in foam-nesting frogs

Animal Biology ◽

10.1163/157075511x570312 ◽

2011 ◽

Vol 61 (2) ◽

pp. 225-238 ◽

Cited By ~ 15

Author(s):

Wen Bo Liao ◽

Zhi Ping Mi ◽

Cai Quan Zhou ◽

Ling Jin ◽

Xian Han ◽

...

Keyword(s):

Sperm Competition ◽

Published Data ◽

Male Mating ◽

Data Sets ◽

Testis Size ◽

Data Set ◽

Monogamous Species ◽

Large Testis ◽

Testes Size ◽

Testis Mass

AbstractComparative studies of the relative testes size in animals show that promiscuous species have relatively larger testes than monogamous species. Sperm competition favours the evolution of larger ejaculates in many animals – they give bigger testes. In the view, we presented data on relative testis mass for 17 Chinese species including 3 polyandrous species. We analyzed relative testis mass within the Chinese data set and combining those data with published data sets on Japanese and African frogs. We found that polyandrous foam nesting species have relatively large testes, suggesting that sperm competition was an important factor affecting the evolution of relative testes size. For 4 polyandrous species testes mass is positively correlated with intensity (males/mating) but not with risk (frequency of polyandrous matings) of sperm competition.

Download Full-text

The Midlatitude Continental Convective Clouds Experiment (MC3E) sounding network: operations, processing and analysis

Atmospheric Measurement Techniques ◽

10.5194/amt-8-421-2015 ◽

2015 ◽

Vol 8 (1) ◽

pp. 421-434 ◽

Cited By ~ 18

Author(s):

M. P. Jensen ◽

T. Toto ◽

D. Troyan ◽

P. E. Ciesielski ◽

D. Holdridge ◽

...

Keyword(s):

Large Scale ◽

Scale Model ◽

Data Sets ◽

Central Plains ◽

Data Set ◽

Convective Systems ◽

Convective Clouds ◽

Quality Checks ◽

Network Operations ◽

The Impact

Abstract. The Midlatitude Continental Convective Clouds Experiment (MC3E) took place during the spring of 2011 centered in north-central Oklahoma, USA. The main goal of this field campaign was to capture the dynamical and microphysical characteristics of precipitating convective systems in the US Central Plains. A major component of the campaign was a six-site radiosonde array designed to capture the large-scale variability of the atmospheric state with the intent of deriving model forcing data sets. Over the course of the 46-day MC3E campaign, a total of 1362 radiosondes were launched from the enhanced sonde network. This manuscript provides details on the instrumentation used as part of the sounding array, the data processing activities including quality checks and humidity bias corrections and an analysis of the impacts of bias correction and algorithm assumptions on the determination of convective levels and indices. It is found that corrections for known radiosonde humidity biases and assumptions regarding the characteristics of the surface convective parcel result in significant differences in the derived values of convective levels and indices in many soundings. In addition, the impact of including the humidity corrections and quality controls on the thermodynamic profiles that are used in the derivation of a large-scale model forcing data set are investigated. The results show a significant impact on the derived large-scale vertical velocity field illustrating the importance of addressing these humidity biases.

Download Full-text

Improving SAR Altimeter processing over the coastal zone and inland waters - the ESA HYDROCOASTAL project

10.5194/egusphere-egu21-9 ◽

2021 ◽

Author(s):

David Cotton ◽

Keyword(s):

Coastal Zone ◽

Test Data ◽

River Discharge ◽

Altimeter Data ◽

Inland Waters ◽

Data Sets ◽

Data Set ◽

Discharge Data ◽

Processing Algorithms ◽

The Impact

IntroductionHYDROCOASTAL is a two year project funded by ESA, with the objective to maximise exploitation of SAR and SARin altimeter measurements in the coastal zone and inland waters, by evaluating and implementing new approaches to process SAR and SARin data from CryoSat-2, and SAR altimeter data from Sentinel-3A and Sentinel-3B. Optical data from Sentinel-2 MSI and Sentinel-3 OLCI instruments will also be used in generating River Discharge products.New SAR and SARin processing algorithms for the coastal zone and inland waters will be developed and implemented and evaluated through an initial Test Data Set for selected regions. From the results of this evaluation a processing scheme will be implemented to generate global coastal zone and river discharge data sets.A series of case studies will assess these products in terms of their scientific impacts.All the produced data sets will be available on request to external researchers, and full descriptions of the processing algorithms will be provided&#160;ObjectivesThe scientific objectives of HYDROCOASTAL are to enhance our understanding&#160; of interactions between the inland water and coastal zone, between the coastal zone and the open ocean, and the small scale processes that govern these interactions. Also the project aims to improve our capability to characterize the variation at different time scales of inland water storage, exchanges with the ocean and the impact on regional sea-level changesThe technical objectives are to develop and evaluate&#160; new SAR&#160; and SARin altimetry processing techniques in support of the scientific objectives, including stack processing, and filtering, and retracking. Also an improved Wet Troposphere Correction will be developed and evaluated.Project&#160; OutlineThere are four tasks to the project<ul><li>Scientific Review and Requirements Consolidation: Review the current state of the art in SAR and SARin altimeter data processing as applied to the coastal zone and to inland waters</li> <li>Implementation and Validation: New processing algorithms with be implemented to generate a Test Data sets, which will be validated against models, in-situ data, and other satellite data sets. Selected algorithms will then be used to generate global coastal zone and river discharge data sets</li> <li>Impacts Assessment: The impact of these global products will be assess in a series of Case Studies</li> <li>Outreach and Roadmap: Outreach material will be prepared and distributed to engage with the wider scientific community and provide recommendations for development of future missions and future research.</li> </ul>&#160;PresentationThe presentation will provide an overview to the project, present the different SAR altimeter processing algorithms that are being evaluated in the first phase of the project, and early results from the evaluation of the initial test data set.&#160;

Download Full-text

Children with 5′-end NF1 gene mutations are more likely to have glioma

Neurology Genetics ◽

10.1212/nxg.0000000000000192 ◽

2017 ◽

Vol 3 (5) ◽

pp. e192 ◽

Cited By ~ 12

Author(s):

Corina Anastasaki ◽

Stephanie M. Morris ◽

Feng Gao ◽

David H. Gutmann

Keyword(s):

Gene Mutation ◽

Statistical Significance ◽

Gene Mutations ◽

Neurofibromatosis Type ◽

Published Data ◽

Data Sets ◽

Nonsense Mutations ◽

Data Set ◽

Nf1 Gene ◽

The Relationship

Objective:To ascertain the relationship between the germline NF1 gene mutation and glioma development in patients with neurofibromatosis type 1 (NF1).Methods:The relationship between the type and location of the germline NF1 mutation and the presence of a glioma was analyzed in 37 participants with NF1 from one institution (Washington University School of Medicine [WUSM]) with a clinical diagnosis of NF1. Odds ratios (ORs) were calculated using both unadjusted and weighted analyses of this data set in combination with 4 previously published data sets.Results:While no statistical significance was observed between the location and type of the NF1 mutation and glioma in the WUSM cohort, power calculations revealed that a sample size of 307 participants would be required to determine the predictive value of the position or type of the NF1 gene mutation. Combining our data set with 4 previously published data sets (n = 310), children with glioma were found to be more likely to harbor 5′-end gene mutations (OR = 2; p = 0.006). Moreover, while not clinically predictive due to insufficient sensitivity and specificity, this association with glioma was stronger for participants with 5′-end truncating (OR = 2.32; p = 0.005) or 5′-end nonsense (OR = 3.93; p = 0.005) mutations relative to those without glioma.Conclusions:Individuals with NF1 and glioma are more likely to harbor nonsense mutations in the 5′ end of the NF1 gene, suggesting that the NF1 mutation may be one predictive factor for glioma in this at-risk population.

Download Full-text