scholarly journals Benchmarking metagenomics classifiers on ancient viral DNA: a simulation study

2021 ◽  
Author(s):  
Yami Ommar Arizmendi C&aacuterdenas ◽  
Samuel Neuenschwander ◽  
Anna-Sapfo Malaspinas

Owing to technological advances in ancient DNA, it is now possible to sequence viruses from the past to track down their origin and evolution. However, ancient DNA data is considerably more degraded and contaminated than modern data making the identification of ancient viral genomes particularly challenging. Several methods to characterise the modern microbiome (and, within this, the virome) have been developed. Many of them assign sequenced reads to specific taxa to characterise the organisms present in a sample of interest. While these existing tools are routinely used in modern data, their performance when applied to ancient virome data remains unknown. In this work, we conduct an extensive simulation study using public viral sequences to establish which tool is the most suitable for ancient virome studies. We compare the performance of four widely used classifiers, namely Centrifuge, Kraken2, DIAMOND and MetaPhlAn2, in correctly assigning sequencing reads to the corresponding viruses. To do so, we simulate reads by adding noise typical of ancient DNA to a randomly chosen set of publicly available viral sequences and to the human genome. We fragment the DNA into different lengths, add sequencing error and C to T and G to A deamination substitutions at the read termini. Then we measure the resulting precision and sensitivity for all classifiers. Across most simulations, 119 out of the 120 simulated viruses are recovered by Centrifuge, Kraken2 and DIAMOND in contrast to MetaPhlAn2 which recovers only around one third. While deamination damage has little impact on the performance of the classifiers, DIAMOND and Kraken2 cannot classify very short reads. For data with longer fragments, if precision is strongly favoured over sensitivity, DIAMOND performs best. However, since Centrifuge can handle short reads and since it achieves the highest sensitivity and precision at the species level, it is our recommended tool overall. Regardless of the tool used, our simulations indicate that, for ancient human studies, users should use strict filters to remove all reads of potential human origin. Finally, if the goal is to detect a specific virus, given the high variability observed among tested viral sequences, a simulation study to determine if a given tool can recover the virus of interest should be conducted prior to analysing real data.

2020 ◽  
Author(s):  
Chen Cao ◽  
Matthew Greenberg ◽  
Quan Long

AbstractMany tools can reconstruct viral sequences based on next generation sequencing reads. Although existing tools effectively recover local regions, their accuracy suffers when reconstructing the whole viral genomes (strains). Moreover, they consume significant memory when the sequencing coverage is high or when the genome size is large. We present WgLink to meet this challenge. WgLink takes local reconstructions produced by other tools as input and patches the resulting segments together into coherent whole-genome strains. We accomplish this using an L0 + L1-regularized regression synthesizing variant allele frequency data with physical linkage between multiple variants spanning multiple regions simultaneously. WgLink achieves higher accuracy than existing tools both on simulated and real data sets while using significantly less memory (RAM) and fewer CPU hours. Source code and binaries are freely available at https://github.com/theLongLab/wglink.


2021 ◽  
Author(s):  
Jakob Raymaekers ◽  
Peter J. Rousseeuw

AbstractMany real data sets contain numerical features (variables) whose distribution is far from normal (Gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box–Cox and Yeo–Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.


Author(s):  
Manish C Choudhary ◽  
Charles R Crain ◽  
Xueting Qiu ◽  
William Hanage ◽  
Jonathan Z Li

Abstract Background Both SARS-CoV-2 reinfection and persistent infection have been reported, but sequence characteristics in these scenarios have not been described. We assessed published cases of SARS-CoV-2 reinfection and persistence, characterizing the hallmarks of reinfecting sequences and the rate of viral evolution in persistent infection. Methods A systematic review of PubMed was conducted to identify cases of SARS-CoV-2 reinfection and persistence with available sequences. Nucleotide and amino acid changes in the reinfecting sequence were compared to both the initial and contemporaneous community variants. Time-measured phylogenetic reconstruction was performed to compare intra-host viral evolution in persistent SARS-CoV-2 to community-driven evolution. Results Twenty reinfection and nine persistent infection cases were identified. Reports of reinfection cases spanned a broad distribution of ages, baseline health status, reinfection severity, and occurred as early as 1.5 months or >8 months after the initial infection. The reinfecting viral sequences had a median of 17.5 nucleotide changes with enrichment in the ORF8 and N genes. The number of changes did not differ by the severity of reinfection and reinfecting variants were similar to the contemporaneous sequences circulating in the community. Patients with persistent COVID-19 demonstrated more rapid accumulation of sequence changes than seen with community-driven evolution with continued evolution during convalescent plasma or monoclonal antibody treatment. Conclusions Reinfecting SARS-CoV-2 viral genomes largely mirror contemporaneous circulating sequences in that geographic region, while persistent COVID-19 has been largely described in immunosuppressed individuals and is associated with accelerated viral evolution.


1980 ◽  
Vol 210 (1180) ◽  
pp. 423-435 ◽  

We have cloned and propagated in prokaryotic vectors the viral DNA sequences that are integrated in a variety of cells transformed by adenovirus 2 or SV40. Analysis of the clones reveals that the viral DNA sequences sometimes are arranged in a simple fashion, collinear with the viral genome; in other cell lines there are complex arrangements of viral sequences in which tracts of the viral genome are inverted with respect to each other. In several cases the nucleotide sequences at the joints between cell and viral sequences have been determined: usually there is a sharp transition between cellular and viral DNAs. The viral sequences are integrated at different locations within the genomes of different cell lines; likewise there is no specific site on the viral genomes at which integration occurs. Sometimes the viral sequences are integrated within repetitive cellular DNA, and sometimes within unique sequences. In some cases there is evidence that the viral sequences along with the flanking cell DNA have been amplified after integration. The sequences that flank the viral insertion in the line of SV40-transformed rat cells known as 14B have been used as probes to isolate, from untransformed rat cells, clones that carry the region of the chromosome in which integration occurred. Analysis of the structure of these clones by restriction endonuclease digestion and heteroduplex formation shows that a rearrangement of cellular sequences has occurred, presumably as a consequence of integration.


2006 ◽  
Vol 87 (10) ◽  
pp. 3045-3051 ◽  
Author(s):  
Mazen S. Habayeb ◽  
Sophia K. Ekengren ◽  
Dan Hultmark

Several viruses, including picornaviruses, are known to establish persistent infections, but the mechanisms involved are poorly understood. Here, a novel picorna-like virus, Nora virus, which causes a persistent infection in Drosophila melanogaster, is described. It has a single-stranded, positive-sense genomic RNA of 11879 nt, followed by a poly(A) tail. Unlike other picorna-like viruses, the genome has four open reading frames (ORFs). One ORF encodes a picornavirus-like cassette of proteins for virus replication, including an iflavirus-like RNA-dependent RNA polymerase and a helicase that is related to those of mammalian picornaviruses. The three other ORFs are not closely related to any previously described viral sequences. The unusual sequence and genome organization in Nora virus suggest that it belongs to a new family of picorna-like viruses. Surprisingly, Nora virus could be detected in all tested D. melanogaster laboratory stocks, as well as in wild-caught material. The viral titres varied enormously, between 104 and 1010 viral genomes per fly in different stocks, without causing obvious pathological effects. The virus was also found in Drosophila simulans, a close relative of D. melanogaster, but not in more distantly related Drosophila species. It will now be possible to use Drosophila genetics to study the factors that control this persistent infection.


2021 ◽  
Vol 9 (1) ◽  
pp. 190-210
Author(s):  
Arvid Sjölander ◽  
Ola Hössjer

Abstract Unmeasured confounding is an important threat to the validity of observational studies. A common way to deal with unmeasured confounding is to compute bounds for the causal effect of interest, that is, a range of values that is guaranteed to include the true effect, given the observed data. Recently, bounds have been proposed that are based on sensitivity parameters, which quantify the degree of unmeasured confounding on the risk ratio scale. These bounds can be used to compute an E-value, that is, the degree of confounding required to explain away an observed association, on the risk ratio scale. We complement and extend this previous work by deriving analogous bounds, based on sensitivity parameters on the risk difference scale. We show that our bounds can also be used to compute an E-value, on the risk difference scale. We compare our novel bounds with previous bounds through a real data example and a simulation study.


2017 ◽  
Vol 40 (2) ◽  
pp. 205-221 ◽  
Author(s):  
Shahryar Mirzaei ◽  
Gholam Reza Mohtashami Borzadaran ◽  
Mohammad Amini

In this paper, we consider two well-known methods for analysis of the Gini index, which are U-statistics and linearization for some incomedistributions. In addition, we evaluate two different methods for some properties of their proposed estimators. Also, we compare two methods with resampling techniques in approximating some properties of the Gini index. A simulation study shows that the linearization method performs 'well' compared to the Gini estimator based on U-statistics. A brief study on real data supports our findings.


2021 ◽  
Vol 28 (1) ◽  
Author(s):  
Diego Michael Cornelius dos Santos ◽  
Bruna Karine dos Santos ◽  
César Gabriel dos Santos

Abstract: Due to technological advances, trade politicies and society's consumption patterns, competitiveness among companies has increased considerably, requiring practices that provide a constant improvement in production indicators and product quality. In this context, the use of Toyota Production System tools, also known as Lean Manufacturing, have a fundamental role in the elimination of waste and continuous improvement of industrial production levels. Thus, this work aims to implement a standardized work routine among employees working in a market of parts in an Agricultural Machinery industry, which lacks production methods. To represent this situation, real data were used, which correspond to the needs of the assembly line, and which served as the basis for the analysis and implementation of a new work routine. The results obtained enabled the creation of a standardized work routine, which was obtained by balancing activities between operators and eliminating activities that did not add value to the product.


2019 ◽  
Author(s):  
Leili Tapak ◽  
Omid Hamidi ◽  
Majid Sadeghifar ◽  
Hassan Doosti ◽  
Ghobad Moradi

Abstract Objectives Zero-inflated proportion or rate data nested in clusters due to the sampling structure can be found in many disciplines. Sometimes, the rate response may not be observed for some study units because of some limitations (false negative) like failure in recording data and the zeros are observed instead of the actual value of the rate/proportions (low incidence). In this study, we proposed a multilevel zero-inflated censored Beta regression model that can address zero-inflation rate data with low incidence.Methods We assumed that the random effects are independent and normally distributed. The performance of the proposed approach was evaluated by application on a three level real data set and a simulation study. We applied the proposed model to analyze brucellosis diagnosis rate data and investigate the effects of climatic and geographical position. For comparison, we also applied the standard zero-inflated censored Beta regression model that does not account for correlation.Results Results showed the proposed model performed better than zero-inflated censored Beta based on AIC criterion. Height (p-value <0.0001), temperature (p-value <0.0001) and precipitation (p-value = 0.0006) significantly affected brucellosis rates. While, precipitation in ZICBETA model was not statistically significant (p-value =0.385). Simulation study also showed that the estimations obtained by maximum likelihood approach had reasonable in terms of mean square error.Conclusions The results showed that the proposed method can capture the correlations in the real data set and yields accurate parameter estimates.


Mathematics ◽  
2020 ◽  
Vol 8 (10) ◽  
pp. 1786 ◽  
Author(s):  
A. M. Abd El-Raheem ◽  
M. H. Abu-Moussa ◽  
Marwa M. Mohie El-Din ◽  
E. H. Hafez

In this article, a progressive-stress accelerated life test (ALT) that is based on progressive type-II censoring is studied. The cumulative exposure model is used when the lifetime of test units follows Pareto-IV distribution. Different estimates as the maximum likelihood estimates (MLEs) and Bayes estimates (BEs) for the model parameters are discussed. Bayesian estimates are derived while using the Tierney and Kadane (TK) approximation method and the importance sampling method. The asymptotic and bootstrap confidence intervals (CIs) of the parameters are constructed. A real data set is analyzed in order to clarify the methods proposed through this paper. Two types of the progressive-stress tests, the simple ramp-stress test and multiple ramp-stress test, are compared through the simulation study. Finally, some interesting conclusions are drawn.


Sign in / Sign up

Export Citation Format

Share Document