Benchmarking metagenomics classifiers on ancient viral DNA: a simulation study

Owing to technological advances in ancient DNA, it is now possible to sequence viruses from the past to track down their origin and evolution. However, ancient DNA data is considerably more degraded and contaminated than modern data making the identification of ancient viral genomes particularly challenging. Several methods to characterise the modern microbiome (and, within this, the virome) have been developed. Many of them assign sequenced reads to specific taxa to characterise the organisms present in a sample of interest. While these existing tools are routinely used in modern data, their performance when applied to ancient virome data remains unknown. In this work, we conduct an extensive simulation study using public viral sequences to establish which tool is the most suitable for ancient virome studies. We compare the performance of four widely used classifiers, namely Centrifuge, Kraken2, DIAMOND and MetaPhlAn2, in correctly assigning sequencing reads to the corresponding viruses. To do so, we simulate reads by adding noise typical of ancient DNA to a randomly chosen set of publicly available viral sequences and to the human genome. We fragment the DNA into different lengths, add sequencing error and C to T and G to A deamination substitutions at the read termini. Then we measure the resulting precision and sensitivity for all classifiers. Across most simulations, 119 out of the 120 simulated viruses are recovered by Centrifuge, Kraken2 and DIAMOND in contrast to MetaPhlAn2 which recovers only around one third. While deamination damage has little impact on the performance of the classifiers, DIAMOND and Kraken2 cannot classify very short reads. For data with longer fragments, if precision is strongly favoured over sensitivity, DIAMOND performs best. However, since Centrifuge can handle short reads and since it achieves the highest sensitivity and precision at the species level, it is our recommended tool overall. Regardless of the tool used, our simulations indicate that, for ancient human studies, users should use strict filters to remove all reads of potential human origin. Finally, if the goal is to detect a specific virus, given the high variability observed among tested viral sequences, a simulation study to determine if a given tool can recover the virus of interest should be conducted prior to analysing real data.

Download Full-text

WgLink: reconstructing whole-genome viral haplotypes using L0 + L1-regularization

10.1101/2020.08.14.251835 ◽

2020 ◽

Author(s):

Chen Cao ◽

Matthew Greenberg ◽

Quan Long

Keyword(s):

Real Data ◽

Data Sets ◽

Whole Genome ◽

Regularized Regression ◽

Viral Genomes ◽

Physical Linkage ◽

Multiple Regions ◽

Viral Sequences ◽

Multiple Variants ◽

Generation Sequencing

AbstractMany tools can reconstruct viral sequences based on next generation sequencing reads. Although existing tools effectively recover local regions, their accuracy suffers when reconstructing the whole viral genomes (strains). Moreover, they consume significant memory when the sequencing coverage is high or when the genome size is large. We present WgLink to meet this challenge. WgLink takes local reconstructions produced by other tools as input and patches the resulting segments together into coherent whole-genome strains. We accomplish this using an L0 + L1-regularized regression synthesizing variant allele frequency data with physical linkage between multiple variants spanning multiple regions simultaneously. WgLink achieves higher accuracy than existing tools both on simulated and real data sets while using significantly less memory (RAM) and fewer CPU hours. Source code and binaries are freely available at https://github.com/theLongLab/wglink.

Download Full-text

Transforming variables to central normality

Machine Learning ◽

10.1007/s10994-021-05960-5 ◽

2021 ◽

Author(s):

Jakob Raymaekers ◽

Peter J. Rousseeuw

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimator ◽

Simulation Study ◽

Real Data ◽

Data Sets ◽

Transformation Parameter ◽

Likelihood Estimator ◽

Extensive Simulation ◽

Highly Sensitive

AbstractMany real data sets contain numerical features (variables) whose distribution is far from normal (Gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box–Cox and Yeo–Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.

Download Full-text

SARS-CoV-2 Sequence Characteristics of COVID-19 Persistence and Reinfection

Clinical Infectious Diseases ◽

10.1093/cid/ciab380 ◽

2021 ◽

Author(s):

Manish C Choudhary ◽

Charles R Crain ◽

Xueting Qiu ◽

William Hanage ◽

Jonathan Z Li

Keyword(s):

Persistent Infection ◽

Viral Evolution ◽

Phylogenetic Reconstruction ◽

Geographic Region ◽

Antibody Treatment ◽

Viral Genomes ◽

Monoclonal Antibody Treatment ◽

Viral Sequences ◽

Sequence Characteristics ◽

Baseline Health

Abstract Background Both SARS-CoV-2 reinfection and persistent infection have been reported, but sequence characteristics in these scenarios have not been described. We assessed published cases of SARS-CoV-2 reinfection and persistence, characterizing the hallmarks of reinfecting sequences and the rate of viral evolution in persistent infection. Methods A systematic review of PubMed was conducted to identify cases of SARS-CoV-2 reinfection and persistence with available sequences. Nucleotide and amino acid changes in the reinfecting sequence were compared to both the initial and contemporaneous community variants. Time-measured phylogenetic reconstruction was performed to compare intra-host viral evolution in persistent SARS-CoV-2 to community-driven evolution. Results Twenty reinfection and nine persistent infection cases were identified. Reports of reinfection cases spanned a broad distribution of ages, baseline health status, reinfection severity, and occurred as early as 1.5 months or >8 months after the initial infection. The reinfecting viral sequences had a median of 17.5 nucleotide changes with enrichment in the ORF8 and N genes. The number of changes did not differ by the severity of reinfection and reinfecting variants were similar to the contemporaneous sequences circulating in the community. Patients with persistent COVID-19 demonstrated more rapid accumulation of sequence changes than seen with community-driven evolution with continued evolution during convalescent plasma or monoclonal antibody treatment. Conclusions Reinfecting SARS-CoV-2 viral genomes largely mirror contemporaneous circulating sequences in that geographic region, while persistent COVID-19 has been largely described in immunosuppressed individuals and is associated with accelerated viral evolution.

Download Full-text

Integration of viral DNA sequences in cells transformed by adenovirus 2 or SV40

Proceedings of the Royal Society of London Series B Biological Sciences ◽

10.1098/rspb.1980.0144 ◽

1980 ◽

Vol 210 (1180) ◽

pp. 423-435 ◽

Cited By ~ 2

Keyword(s):

Cell Lines ◽

Dna Sequences ◽

Viral Genome ◽

Viral Dna ◽

Viral Genomes ◽

Viral Dnas ◽

Cellular Dna ◽

Viral Insertion ◽

Viral Sequences ◽

Heteroduplex Formation

We have cloned and propagated in prokaryotic vectors the viral DNA sequences that are integrated in a variety of cells transformed by adenovirus 2 or SV40. Analysis of the clones reveals that the viral DNA sequences sometimes are arranged in a simple fashion, collinear with the viral genome; in other cell lines there are complex arrangements of viral sequences in which tracts of the viral genome are inverted with respect to each other. In several cases the nucleotide sequences at the joints between cell and viral sequences have been determined: usually there is a sharp transition between cellular and viral DNAs. The viral sequences are integrated at different locations within the genomes of different cell lines; likewise there is no specific site on the viral genomes at which integration occurs. Sometimes the viral sequences are integrated within repetitive cellular DNA, and sometimes within unique sequences. In some cases there is evidence that the viral sequences along with the flanking cell DNA have been amplified after integration. The sequences that flank the viral insertion in the line of SV40-transformed rat cells known as 14B have been used as probes to isolate, from untransformed rat cells, clones that carry the region of the chromosome in which integration occurred. Analysis of the structure of these clones by restriction endonuclease digestion and heteroduplex formation shows that a rearrangement of cellular sequences has occurred, presumably as a consequence of integration.

Download Full-text

Nora virus, a persistent virus in Drosophila, defines a new picorna-like virus family

Journal of General Virology ◽

10.1099/vir.0.81997-0 ◽

2006 ◽

Vol 87 (10) ◽

pp. 3045-3051 ◽

Cited By ~ 68

Author(s):

Mazen S. Habayeb ◽

Sophia K. Ekengren ◽

Dan Hultmark

Keyword(s):

Persistent Infection ◽

Open Reading Frames ◽

Close Relative ◽

Virus Family ◽

Viral Genomes ◽

Persistent Infections ◽

New Family ◽

Nora Virus ◽

Persistent Virus ◽

Viral Sequences

Several viruses, including picornaviruses, are known to establish persistent infections, but the mechanisms involved are poorly understood. Here, a novel picorna-like virus, Nora virus, which causes a persistent infection in Drosophila melanogaster, is described. It has a single-stranded, positive-sense genomic RNA of 11879 nt, followed by a poly(A) tail. Unlike other picorna-like viruses, the genome has four open reading frames (ORFs). One ORF encodes a picornavirus-like cassette of proteins for virus replication, including an iflavirus-like RNA-dependent RNA polymerase and a helicase that is related to those of mammalian picornaviruses. The three other ORFs are not closely related to any previously described viral sequences. The unusual sequence and genome organization in Nora virus suggest that it belongs to a new family of picorna-like viruses. Surprisingly, Nora virus could be detected in all tested D. melanogaster laboratory stocks, as well as in wild-caught material. The viral titres varied enormously, between 104 and 1010 viral genomes per fly in different stocks, without causing obvious pathological effects. The virus was also found in Drosophila simulans, a close relative of D. melanogaster, but not in more distantly related Drosophila species. It will now be possible to use Drosophila genetics to study the factors that control this persistent infection.

Download Full-text

Novel bounds for causal effects based on sensitivity parameters on the risk difference scale

Journal of Causal Inference ◽

10.1515/jci-2021-0024 ◽

2021 ◽

Vol 9 (1) ◽

pp. 190-210

Author(s):

Arvid Sjölander ◽

Ola Hössjer

Keyword(s):

Risk Ratio ◽

Simulation Study ◽

Observational Studies ◽

Causal Effect ◽

Real Data ◽

Risk Difference ◽

Causal Effects ◽

Ratio Scale ◽

Unmeasured Confounding ◽

Range Of Values

Abstract Unmeasured confounding is an important threat to the validity of observational studies. A common way to deal with unmeasured confounding is to compute bounds for the causal effect of interest, that is, a range of values that is guaranteed to include the true effect, given the observed data. Recently, bounds have been proposed that are based on sensitivity parameters, which quantify the degree of unmeasured confounding on the risk ratio scale. These bounds can be used to compute an E-value, that is, the degree of confounding required to explain away an observed association, on the risk ratio scale. We complement and extend this previous work by deriving analogous bounds, based on sensitivity parameters on the risk difference scale. We show that our bounds can also be used to compute an E-value, on the risk difference scale. We compare our novel bounds with previous bounds through a real data example and a simulation study.

Download Full-text

A comparative study of the Gini coefficient estimators based on the linearization and U-statistics Methods

Revista Colombiana de Estadística ◽

10.15446/rce.v40n2.53399 ◽

2017 ◽

Vol 40 (2) ◽

pp. 205-221 ◽

Cited By ~ 2

Author(s):

Shahryar Mirzaei ◽

Gholam Reza Mohtashami Borzadaran ◽

Mohammad Amini

Keyword(s):

Comparative Study ◽

Simulation Study ◽

Gini Coefficient ◽

Gini Index ◽

Real Data ◽

Linearization Method ◽

U Statistics ◽

The Gini Coefficient

In this paper, we consider two well-known methods for analysis of the Gini index, which are U-statistics and linearization for some incomedistributions. In addition, we evaluate two different methods for some properties of their proposed estimators. Also, we compare two methods with resampling techniques in approximating some properties of the Gini index. A simulation study shows that the linearization method performs 'well' compared to the Gini estimator based on U-statistics. A brief study on real data supports our findings.

Download Full-text

Implementation of a standard work routine using Lean Manufacturing tools: A case Study

Gestão & Produção ◽

10.1590/0104-530x4823-20 ◽

2021 ◽

Vol 28 (1) ◽

Author(s):

Diego Michael Cornelius dos Santos ◽

Bruna Karine dos Santos ◽

César Gabriel dos Santos

Keyword(s):

Continuous Improvement ◽

Lean Manufacturing ◽

Assembly Line ◽

Real Data ◽

Production Methods ◽

Standard Work ◽

Technological Advances ◽

Work Routine ◽

Constant Improvement

Abstract: Due to technological advances, trade politicies and society's consumption patterns, competitiveness among companies has increased considerably, requiring practices that provide a constant improvement in production indicators and product quality. In this context, the use of Toyota Production System tools, also known as Lean Manufacturing, have a fundamental role in the elimination of waste and continuous improvement of industrial production levels. Thus, this work aims to implement a standardized work routine among employees working in a market of parts in an Agricultural Machinery industry, which lacks production methods. To represent this situation, real data were used, which correspond to the needs of the assembly line, and which served as the basis for the analysis and implementation of a new work routine. The results obtained enabled the creation of a standardized work routine, which was obtained by balancing activities between operators and eliminating activities that did not add value to the product.

Download Full-text

Multilevel Zero-inflated Censored Beta Regression Modeling for Proportions and Rate Data with Extra-zeros

10.21203/rs.2.16731/v1 ◽

2019 ◽

Author(s):

Leili Tapak ◽

Omid Hamidi ◽

Majid Sadeghifar ◽

Hassan Doosti ◽

Ghobad Moradi

Keyword(s):

Regression Model ◽

Simulation Study ◽

Real Data ◽

P Value ◽

Parameter Estimates ◽

Beta Regression ◽

Rate Data ◽

Data Set ◽

Proposed Model ◽

Beta Regression Model

Abstract Objectives Zero-inflated proportion or rate data nested in clusters due to the sampling structure can be found in many disciplines. Sometimes, the rate response may not be observed for some study units because of some limitations (false negative) like failure in recording data and the zeros are observed instead of the actual value of the rate/proportions (low incidence). In this study, we proposed a multilevel zero-inflated censored Beta regression model that can address zero-inflation rate data with low incidence.Methods We assumed that the random effects are independent and normally distributed. The performance of the proposed approach was evaluated by application on a three level real data set and a simulation study. We applied the proposed model to analyze brucellosis diagnosis rate data and investigate the effects of climatic and geographical position. For comparison, we also applied the standard zero-inflated censored Beta regression model that does not account for correlation.Results Results showed the proposed model performed better than zero-inflated censored Beta based on AIC criterion. Height (p-value <0.0001), temperature (p-value <0.0001) and precipitation (p-value = 0.0006) significantly affected brucellosis rates. While, precipitation in ZICBETA model was not statistically significant (p-value =0.385). Simulation study also showed that the estimations obtained by maximum likelihood approach had reasonable in terms of mean square error.Conclusions The results showed that the proposed method can capture the correlations in the real data set and yields accurate parameter estimates.

Download Full-text

Accelerated Life Tests under Pareto-IV Lifetime Distribution: Real Data Application and Simulation Study

Mathematics ◽

10.3390/math8101786 ◽

2020 ◽

Vol 8 (10) ◽

pp. 1786 ◽

Cited By ~ 2

Author(s):

A. M. Abd El-Raheem ◽

M. H. Abu-Moussa ◽

Marwa M. Mohie El-Din ◽

E. H. Hafez

Keyword(s):

Simulation Study ◽

Stress Test ◽

Real Data ◽

Accelerated Life Test ◽

Maximum Likelihood Estimates ◽

Cumulative Exposure ◽

Exposure Model ◽

Model Parameters ◽

Data Set ◽

Accelerated Life

In this article, a progressive-stress accelerated life test (ALT) that is based on progressive type-II censoring is studied. The cumulative exposure model is used when the lifetime of test units follows Pareto-IV distribution. Different estimates as the maximum likelihood estimates (MLEs) and Bayes estimates (BEs) for the model parameters are discussed. Bayesian estimates are derived while using the Tierney and Kadane (TK) approximation method and the importance sampling method. The asymptotic and bootstrap confidence intervals (CIs) of the parameters are constructed. A real data set is analyzed in order to clarify the methods proposed through this paper. Two types of the progressive-stress tests, the simple ramp-stress test and multiple ramp-stress test, are compared through the simulation study. Finally, some interesting conclusions are drawn.

Download Full-text