scholarly journals Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics

2021 ◽  
Vol 12 ◽  
Author(s):  
Valérian Lupo ◽  
Mick Van Vlierberghe ◽  
Hervé Vanderschuren ◽  
Frédéric Kerff ◽  
Denis Baurain ◽  
...  

Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.

2018 ◽  
Author(s):  
Luc Cornet ◽  
Loïc Meunier ◽  
Mick Van Vlierberghe ◽  
Raphaël R. Léonard ◽  
Benoit Durieu ◽  
...  

AbstractBACKGROUNDPublicly available genomes are crucial for phylogenetic and metagenomic studies, in which contaminating sequences can be the cause of major problems. This issue is expected to be especially important for Cyanobacteria because axenic strains are notoriously difficult to obtain and keep in culture. Yet, despite their great scientific interest, no data are currently available concerning the quality of publicly available cyanobacterial genomes.RESULTSAs reliably detecting contaminants is a complex task, we designed a pipeline combining six methods in a consensus strategy to assess the contamination level of 440 genome assemblies of Cyanobacteria. Two methods are based on published reference databases of ribosomal genes (SSU rRNA 16S and ribosomal proteins), one is indirectly based on a reference database of marker genes (CheckM), and three are based on complete genome analysis. Among those genome-wide methods, Kraken and DIAMOND blastx share the same reference database that we derived from Ensembl Bacteria, whereas CONCOCT does not require any reference database, instead relying on differences in DNA tetramer frequencies. Given that all the six methods appear to have their own strengths and limitations, we used the consensus of their rankings to infer that >5% of cyanobacterial genome assemblies are highly contaminated by foreign DNA (i.e., contaminants were detected by 5 or 6 methods).CONCLUSIONSOur results will help researchers to check the quality of publicly available genomic data before use in their own analyses. Moreover, we argue that journals should make mandatory the submission of raw read data along with genome assemblies in order to facilitate the detection of contaminants in sequence databases.


Author(s):  
Nicole Foster ◽  
Kor-jent Dijk ◽  
Ed Biffin ◽  
Jennifer Young ◽  
Vicki Thomson ◽  
...  

A proliferation in environmental DNA (eDNA) research has increased the reliance on reference sequence databases to assign unknown DNA sequences to known taxa. Without comprehensive reference databases, DNA extracted from environmental samples cannot be correctly assigned to taxa, limiting the use of this genetic information to identify organisms in unknown sample mixtures. For animals, standard metabarcoding practices involve amplification of the mitochondrial Cytochrome-c oxidase subunit 1 (CO1) region, which is a universally amplifyable region across majority of animal taxa. This region, however, does not work well as a DNA barcode for plants and fungi, and there is no similar universal single barcode locus that has the same species resolution. Therefore, generating reference sequences has been more difficult and several loci have been suggested to be used in parallel to get to species identification. For this reason, we developed a multi-gene targeted capture approach to generate reference DNA sequences for plant taxa across 20 target chloroplast gene regions in a single assay. We successfully compiled a reference database for 93 temperate coastal plants including seagrasses, mangroves, and saltmarshes/samphire’s. We demonstrate the importance of a comprehensive reference database to prevent species going undetected in eDNA studies. We also investigate how using multiple chloroplast gene regions impacts the ability to discriminate between taxa.


2018 ◽  
Vol 29 (14) ◽  
pp. 1682-1692 ◽  
Author(s):  
Marc A. Vittoria ◽  
Elizabeth M. Shenk ◽  
Kevin P. O’Rourke ◽  
Amanda F. Bolgioni ◽  
Sanghee Lim ◽  
...  

Tetraploid cells, which are most commonly generated by errors in cell division, are genomically unstable and have been shown to promote tumorigenesis. Recent genomic studies have estimated that ∼40% of all solid tumors have undergone a genome-doubling event during their evolution, suggesting a significant role for tetraploidy in driving the development of human cancers. To safeguard against the deleterious effects of tetraploidy, nontransformed cells that fail mitosis and become tetraploid activate both the Hippo and p53 tumor suppressor pathways to restrain further proliferation. Tetraploid cells must therefore overcome these antiproliferative barriers to ultimately drive tumor development. However, the genetic routes through which spontaneously arising tetraploid cells adapt to regain proliferative capacity remain poorly characterized. Here, we conducted a comprehensive gain-of-function genome-wide screen to identify microRNAs (miRNAs) that are sufficient to promote the proliferation of tetraploid cells. Our screen identified 23 miRNAs whose overexpression significantly promotes tetraploid proliferation. The vast majority of these miRNAs facilitate tetraploid growth by enhancing mitogenic signaling pathways (e.g., miR-191-3p); however, we also identified several miRNAs that impair the p53/p21 pathway (e.g., miR-523-3p), and a single miRNA (miR-24-3p) that potently inactivates the Hippo pathway via down-regulation of the tumor suppressor gene NF2. Collectively, our data reveal several avenues through which tetraploid cells may regain the proliferative capacity necessary to drive tumorigenesis.


Author(s):  
Nida Tabassum Khan ◽  
Namra Jameel ◽  
Maham Jamil Khan

Functional genomics manipulates genomic data to study genes and its expression on a genome wide scale involving high-throughput methods. The keyobjective of Functional genomics is to exploit the data acquired from transcriptomic and genomic studies to explain the functions and interfaces of a genome and its corresponding phenotype.


2015 ◽  
Author(s):  
Peter Menzel ◽  
Kim Lee Ng ◽  
Anders Krogh

The constantly decreasing cost and increasing output of current sequencing technologies enable large scale metagenomic studies of microbial communities from diverse habitats. Therefore, fast and accurate methods for taxonomic classification are needed, which can operate on increasingly larger datasets and reference databases. Recently, several fast metagenomic classifiers have been developed, which are based on comparison of genomic k-mers. However, nucleotide comparison using a fixed k-mer length often lacks the sensitivity to overcome the evolutionary distance between sampled species and genomes in the reference database. Here, we present the novel metagenome classifier Kaiju for fast assignment of reads to taxa. Kaiju finds maximum exact matches on the protein-level using the Borrows-Wheeler transform, and can optionally allow amino acid substitutions in the search using a greedy heuristic. We show in a genome exclusion study that Kaiju can classify more reads with higher sensitivity and similar precision compared to fast k-mer based classifiers, especially in genera that are underrepresented in reference databases. We also demonstrate that Kaiju classifies more than twice as many reads in ten real metagenomes compared to programs based on genomic k-mers. Kaiju can process up to millions of reads per minute, and its memory footprint is below 6 GB of RAM, allowing the analysis on a standard PC. The program is available under the GPL3 license at: http://bioinformatics-centre.github.io/kaiju


2021 ◽  
Vol 17 (11) ◽  
pp. e1009581
Author(s):  
Michael S. Robeson ◽  
Devon R. O’Rourke ◽  
Benjamin D. Kaehler ◽  
Michal Ziemski ◽  
Matthew R. Dillon ◽  
...  

Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.


Author(s):  
Michael S. Robeson ◽  
Devon R. O’Rourke ◽  
Benjamin D. Kaehler ◽  
Michal Ziemski ◽  
Matthew R. Dillon ◽  
...  

AbstractBackgroundNucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardizations limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a software package for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases.ResultsTo highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA, and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes.ConclusionsRESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.


2016 ◽  
Author(s):  
Glen Otero ◽  
Benjamin M. Althouse ◽  
Samuel V. Scarpino

AbstractBackground:Despite high-levels of vaccination, whooping cough, primarily caused by Bordetella pertussis (BP), has persisted and resurged. It remains a major cause of infant death worldwide and is the most prevalent vaccine-preventable disease in developed countries. To date, most genomic studies have focused on a small subset of the BP genome, biasing our clinical understanding and public health awareness.Methods:We performed a Genome-Wide Association Study (GWAS) on 76 U.S. BP whole genomes, including strains from recent outbreaks.Results:A GWAS of the 76 BP isolates revealed a sharp increase in genetic variation associated with the Minnesota 2012 outbreak and identified 52 variants unique to the Minnesota outbreak and 19 unique to the California and Washington outbreaks. None of the identified variants were shared between the outbreaks and the vast majority were previously uncharacterized. We further identified variation associated with pertactin negative strains and acellular vaccination.Conclusions:We identified novel genomic regions associated with recent BP outbreaks. Our results underscore the need for increased whole genome sequencing of BP isolates, which can reduce costly misdiagnosis and improve surveillance. The genes containing these variants warrant further investigation into their possible roles in BP pathogenicity and the ongoing resurgence in the U.S.


2018 ◽  
Author(s):  
Marc A. Vittoria ◽  
Elizabeth M. Shenk ◽  
Kevin P. O’Rourke ◽  
Amanda F. Bolgioni ◽  
Sanghee Lim ◽  
...  

AbstractTetraploid cells, which are most commonly generated by errors in cell division, are genomically unstable and have been shown to promote tumorigenesis. Recent genomic studies have estimated that ∼40% of all solid tumors have undergone a genome-doubling event during their evolution, suggesting a significant role for tetraploidy in driving the development of human cancers. To safeguard against the deleterious effects of tetraploidy, non-transformed cells that fail mitosis and become tetraploid activate both the Hippo and p53 tumor suppressor pathways to restrain further proliferation. Tetraploid cells must therefore overcome these anti-proliferative barriers to ultimately drive tumor development. However, the genetic routes through which spontaneously arising tetraploid cells adapt to regain proliferative capacity remain poorly characterized. Here, we conducted a comprehensive, gain-of-function genome-wide screen to identify miRNAs that are sufficient to promote the proliferation of tetraploid cells. Our screen identified 23 miRNAs whose overexpression significantly promotes tetraploid proliferation. The vast majority of these miRNAs facilitate tetraploid growth by enhancing mitogenic signaling pathways (e.g. miR-191-3p); however, we also identified several miRNAs that impair the p53/p21 pathway (e.g. miR-523-3p), and a single miRNA (miR-24-3p) that potently inactivates the Hippo pathway via downregulation of the tumor suppressor gene NF2. Collectively, our data reveal several avenues through which tetraploid cells may regain the proliferative capacity necessary to drive tumorigenesis.


2019 ◽  
Author(s):  
Song Li ◽  
ZiHui Liu ◽  
Linlin Guo ◽  
Hongjie Li ◽  
Xiaojun Nie ◽  
...  

Abstract Background: The plant ZIP (Zn-regulated, iron-regulated transporter-like protein) transporter family is one of the most essential gene families regulating the uptake, transport and accumulation of microelements, which play important roles in plant growth, development and biofortification. Although the ZIP family has been systematically studied in many plant species, the significance of this family in wheat is not well understood at present. Results: Through a genome-wide search based on the latest wheat reference sequence (IWGSC_V1.1), 58 TaZIP genes were identified. Most of these genes were represented by two to three homoalleles, which were named TaZIP_-A, TaZIP_-B, TaZIP_-D, Protein structure analysis revealed that most TaZIP proteins contain more than six transmembrane (TM) domains and that the distance between TM-3 and TM-4 is variable. Furthermore, the TaZIP proteins clustered into four groups in a phylogenetic tree, and the proteins belonging to the same group shared similar exon-intron structures and conserved motifs. Expression pattern analysis revealed that most TaZIP genes were significantly highly expressed in root, and that nine TaZIP genes were up-regulated at the grain filling stage. When exposed to ZnSO4 and FeCl 3 solutions, TaZIP genes showed different expression patterns, and 16 TaZIP genes were identified as candidate high-affinity Zn transporter genes and 23 as low-affinity Zn transporter genes. Finally, using yeast complementation analysis three TaZIP genes were demonstrated to have the capacity to transport Zn and Fe.Conclusion: This study systematically analyzed the genomic organization, gene structures and expression profiles of TaZIPs. The findings not only provide candidates for further functional analysis, but also contribute to a better understanding of the regulatory roles of ZIPs in wheat.


Sign in / Sign up

Export Citation Format

Share Document