scholarly journals Phylogeny-aware Identification and Correction of Taxonomically Mislabeled Sequences

2016 ◽  
Author(s):  
Alexey M. Kozlov ◽  
Jiajie Zhang ◽  
Pelin Yilmaz ◽  
Frank Oliver Glöckner ◽  
Alexandros Stamatakis

AbstractMolecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically annotated using existing ones. Furthermore, taxonomic mislabelings in reference sequence databases can bias metagenetic studies which rely on the taxonomy. Despite significant efforts to improve the quality of taxonomic annotations, the curation rate is low because of the labour-intensive manual curation process.Here, we present SATIVA, a phylogeny-aware method to automatically identify taxonomically mislabeled sequences (“mislabels”) using statistical models of evolution. We use the Evolutionary Placement Algorithm (EPA) to detect and score sequences whose taxonomic annotation is not supported by the underlying phylogenetic signal, and automatically propose a corrected taxonomic classification for those. Using simulated data, we show that our method attains high accuracy for identification (96.9% sensitivity / 91.7% precision) as well as correction (94.9% sensitivity / 89.9% precision) of mislabels. Furthermore, an analysis of four widely used microbial 16S reference databases (Greengenes, LTP, RDP and SILVA) indicates that they currently contain between 0.2% and 2.5% mislabels. Finally, we use SATIVA to perform an in-depth evaluation of alternative taxonomies for Cyanobacteria.SATIVA is freely available at https://github.com/amkozlov/sativa.

2016 ◽  
Vol 1 ◽  
pp. 4 ◽  
Author(s):  
Sarah Auburn ◽  
Ulrike Böhme ◽  
Sascha Steinbiss ◽  
Hidayat Trimarsanto ◽  
Jessica Hostetler ◽  
...  

Plasmodium vivax is now the predominant cause of malaria in the Asia-Pacific, South America and Horn of Africa. Laboratory studies of this species are constrained by the inability to maintain the parasite in continuous ex vivo culture, but genomic approaches provide an alternative and complementary avenue to investigate the parasite’s biology and epidemiology. To date, molecular studies of P. vivax have relied on the Salvador-I reference genome sequence, derived from a monkey-adapted strain from South America. However, the Salvador-I reference remains highly fragmented with over 2500 unassembled scaffolds.  Using high-depth Illumina sequence data, we assembled and annotated a new reference sequence, PvP01, sourced directly from a patient from Papua Indonesia. Draft assemblies of isolates from China (PvC01) and Thailand (PvT01) were also prepared for comparative purposes. The quality of the PvP01 assembly is improved greatly over Salvador-I, with fragmentation reduced to 226 scaffolds. Detailed manual curation has ensured highly comprehensive annotation, with functions attributed to 58% core genes in PvP01 versus 38% in Salvador-I. The assemblies of PvP01, PvC01 and PvT01 are larger than that of Salvador-I (28-30 versus 27 Mb), owing to improved assembly of the subtelomeres.  An extensive repertoire of over 1200 Plasmodium interspersed repeat (pir) genes were identified in PvP01 compared to 346 in Salvador-I, suggesting a vital role in parasite survival or development. The manually curated PvP01 reference and PvC01 and PvT01 draft assemblies are important new resources to study vivax malaria. PvP01 is maintained at GeneDB and ongoing curation will ensure continual improvements in assembly and annotation quality.


2017 ◽  
Vol 4 (8) ◽  
pp. 170315 ◽  
Author(s):  
Oleksandr Holovachov ◽  
Quiterie Haenel ◽  
Sarah J. Bourlat ◽  
Ulf Jondelius

Precision and reliability of barcode-based biodiversity assessment can be affected at several steps during acquisition and analysis of data. Identification of operational taxonomic units (OTUs) is one of the crucial steps in the process and can be accomplished using several different approaches, namely, alignment-based, probabilistic, tree-based and phylogeny-based. The number of identified sequences in the reference databases affects the precision of identification. This paper compares the identification of marine nematode OTUs using alignment-based, tree-based and phylogeny-based approaches. Because the nematode reference dataset is limited in its taxonomic scope, OTUs can only be assigned to higher taxonomic categories, families. The phylogeny-based approach using the evolutionary placement algorithm provided the largest number of positively assigned OTUs and was least affected by erroneous sequences and limitations of reference data, compared to alignment-based and tree-based approaches.


Widyaparwa ◽  
2017 ◽  
Vol 45 (2) ◽  
pp. 151-164
Author(s):  
Novita Sumarlin Putri

Tindak tutur komisif merupakan salah satu aspek pragmatik yang harus diperhatikan oleh penerjemah ketika menerjemahkan teks. Hal itu dilakukan agar menghasilkan terjemahan yang berkualitas dari aspek keakuratan dan keberterimaan. Berdasarkan alasan tersebut, penelitian ini bertujuan mendiskripsikan tingkat keakuratan dan keberterimaan terjemahan kalimat yang mengakomodasi tindak tutur komisif dengan pendekatan pragmatik. Data yang digunakan ialah tuturan komisif dan hasil penilaian kualitas terjemahan. Data bersumber dari novel Insurgent karya Veronica Roth dan informan. Data dikumpulkan dengan cara analisis dokumen, kuesioner dan Focus Group Discussion. Selanjutnya, data dianalisis dengan cara analisis domain, taksonomi, komponensial, dan tema budaya. Hasil penelitian ini menunjukkan bahwa terjemahan dalam novel Insurgent mempunyai nilai keakuratan dan keberterimaan yang cukup tinggi. Berdasarkan penelitian ini, dapat disimpulkan bahwa tingkat keakuratan dan keberterimaan pada setiap jenis tindak tutur komisif memiliki dampak terhadap kualitas keseluruhan terjemahan kalimat yang mengandung tindak tutur komisif.Commissive speech act is one of the pragmatic aspects to regard by the translator in translating the text. It aims to produce a qualified translation in regarding accuracy and acceptability aspects. According to the aspects, this research aims to describe accuracy and acceptability of translation in sentences which accommodate commissive speech act using pragmatic approach. The data used is commissive speech and qualitative translation value result. The sources of the data are an Insurgent novel by Veronica Roth and informants. The data were collected through document analysis, questionnaire, and Focus Group Discussion then analyzed the domain, taxonomic, componential analysis, and cultural theme. The result shows that translation in the Insurgent novel has high accuracy and acceptability values. This research concludes that the accuracy and acceptability level in each commissive speech act has an impact on quality of whole translated sentences which contain commissive speech act.


2021 ◽  
Vol 18 (2) ◽  
pp. 156-164 ◽  
Author(s):  
Catherine L. Lawson ◽  
Andriy Kryshtafovych ◽  
Paul D. Adams ◽  
Pavel V. Afonine ◽  
Matthew L. Baker ◽  
...  

AbstractThis paper describes outcomes of the 2019 Cryo-EM Model Challenge. The goals were to (1) assess the quality of models that can be produced from cryogenic electron microscopy (cryo-EM) maps using current modeling software, (2) evaluate reproducibility of modeling results from different software developers and users and (3) compare performance of current metrics used for model evaluation, particularly Fit-to-Map metrics, with focus on near-atomic resolution. Our findings demonstrate the relatively high accuracy and reproducibility of cryo-EM models derived by 13 participating teams from four benchmark maps, including three forming a resolution series (1.8 to 3.1 Å). The results permit specific recommendations to be made about validating near-atomic cryo-EM structures both in the context of individual experiments and structure data archives such as the Protein Data Bank. We recommend the adoption of multiple scoring parameters to provide full and objective annotation and assessment of the model, reflective of the observed cryo-EM map density.


2012 ◽  
Vol 29 (6) ◽  
pp. 772-795 ◽  
Author(s):  
Lei Lei ◽  
Guifu Zhang ◽  
Richard J. Doviak ◽  
Robert Palmer ◽  
Boon Leng Cheong ◽  
...  

Abstract The quality of polarimetric radar data degrades as the signal-to-noise ratio (SNR) decreases. This substantially limits the usage of collected polarimetric radar data to high SNR regions. To improve data quality at low SNRs, multilag correlation estimators are introduced. The performance of the multilag estimators for spectral moments and polarimetric parameters is examined through a theoretical analysis and by the use of simulated data. The biases and standard deviations of the estimates are calculated and compared with those estimates obtained using the conventional method.


Author(s):  
Nicole Foster ◽  
Kor-jent Dijk ◽  
Ed Biffin ◽  
Jennifer Young ◽  
Vicki Thomson ◽  
...  

A proliferation in environmental DNA (eDNA) research has increased the reliance on reference sequence databases to assign unknown DNA sequences to known taxa. Without comprehensive reference databases, DNA extracted from environmental samples cannot be correctly assigned to taxa, limiting the use of this genetic information to identify organisms in unknown sample mixtures. For animals, standard metabarcoding practices involve amplification of the mitochondrial Cytochrome-c oxidase subunit 1 (CO1) region, which is a universally amplifyable region across majority of animal taxa. This region, however, does not work well as a DNA barcode for plants and fungi, and there is no similar universal single barcode locus that has the same species resolution. Therefore, generating reference sequences has been more difficult and several loci have been suggested to be used in parallel to get to species identification. For this reason, we developed a multi-gene targeted capture approach to generate reference DNA sequences for plant taxa across 20 target chloroplast gene regions in a single assay. We successfully compiled a reference database for 93 temperate coastal plants including seagrasses, mangroves, and saltmarshes/samphire’s. We demonstrate the importance of a comprehensive reference database to prevent species going undetected in eDNA studies. We also investigate how using multiple chloroplast gene regions impacts the ability to discriminate between taxa.


2019 ◽  
Vol 47 (6) ◽  
pp. E9 ◽  
Author(s):  
Geirmund Unsgård ◽  
Frank Lindseth

3D ultrasound (US) is a convenient tool for guiding the resection of low-grade gliomas, seemingly without deterioration in patients’ quality of life. This article offers an update of the intraoperative workflow and the general principles behind the 3D US acquisition of high-quality images.The authors also provide case examples illustrating the technique in two small mesial temporal lobe lesions and in one insular glioma. Due to the ease of acquiring new images for navigation, the operations can be guided by updated image volumes throughout the entire course of surgery. The high accuracy offered by 3D US systems, based on nearly real-time images, allows for precise and safe resections. This is especially useful when an operation is performed through very narrow transcortical corridors.


2021 ◽  
Author(s):  
Carole Belliardo ◽  
Georgios Koutsovoulos ◽  
Corinne Rancurel ◽  
Mathilde Clement ◽  
Justine Lipuma ◽  
...  

Background | During the last decades, shotgun metagenomics and metabarcoding have highlighted the diversity of microorganisms from environmental or host-associated samples. Most assembled metagenome public repositories use annotation pipelines tailored for prokaryotes regardless of the taxonomic origin of contigs and metagenome-assembled genomes (MAGs). Consequently, eukaryotic contigs and MAGs, with intrinsically different gene features, are not optimally annotated, resulting in an incorrect representation of the eukaryotic component of biodiversity, despite their biological relevance. Results | Using an automated analysis pipeline, we have filtered eukaryotic contigs from 6,873 soil metagenomes from the IMG/M database of the Joint Genome Institute. We have re-annotated genes using eukaryote-tailored methods, yielding 5,6 million eukaryotic proteins. Our pipeline improves eukaryotic proteins completeness, contiguity and quality. Moreover, the better quality of eukaryotic proteins combined with a more comprehensive assignment method improves the taxonomic annotation as well. Conclusions | Using public soil metagenomic data, we provide a dataset of eukaryotic soil proteins with improved completeness and quality as well as a more reliable taxonomic annotation. This unique resource is of interest for any scientist aiming at studying the composition, biological functions and gene flux in soil communities involving eukaryotes.


2013 ◽  
Vol 2013 ◽  
pp. 1-14 ◽  
Author(s):  
Francesc López-Giráldez ◽  
Andrew H. Moeller ◽  
Jeffrey P. Townsend

Phylogenetic research is often stymied by selection of a marker that leads to poor phylogenetic resolution despite considerable cost and effort. Profiles of phylogenetic informativeness provide a quantitative measure for prioritizing gene sampling to resolve branching order in a particular epoch. To evaluate the utility of these profiles, we analyzed phylogenomic data sets from metazoans, fungi, and mammals, thus encompassing diverse time scales and taxonomic groups. We also evaluated the utility of profiles created based on simulated data sets. We found that genes selected via their informativeness dramatically outperformed haphazard sampling of markers. Furthermore, our analyses demonstrate that the original phylogenetic informativeness method can be extended to trees with more than four taxa. Thus, although the method currently predicts phylogenetic signal without specifically accounting for the misleading effects of stochastic noise, it is robust to the effects of homoplasy. The phylogenetic informativeness rankings obtained will allow other researchers to select advantageous genes for future studies within these clades, maximizing return on effort and investment. Genes identified might also yield efficient experimental designs for phylogenetic inference for many sister clades and outgroup taxa that are closely related to the diverse groups of organisms analyzed.


2019 ◽  
Vol 2019 ◽  
pp. 1-22 ◽  
Author(s):  
Sorana D. Bolboacă

Diagnostic tests are approaches used in clinical practice to identify with high accuracy the disease of a particular patient and thus to provide early and proper treatment. Reporting high-quality results of diagnostic tests, for both basic and advanced methods, is solely the responsibility of the authors. Despite the existence of recommendation and standards regarding the content or format of statistical aspects, the quality of what and how the statistic is reported when a diagnostic test is assessed varied from excellent to very poor. This article briefly reviews the steps in the evaluation of a diagnostic test from the anatomy, to the role in clinical practice, and to the statistical methods used to show their performances. The statistical approaches are linked with the phase, clinical question, and objective and are accompanied by examples. More details are provided for phase I and II studies while the statistical treatment of phase III and IV is just briefly presented. Several free online resources useful in the calculation of some statistics are also given.


Sign in / Sign up

Export Citation Format

Share Document