scholarly journals INfrastructure for a PHAge REference Database: Identification of Large-Scale Biases in the Current Collection of Cultured Phage Genomes

PHAGE ◽  
2021 ◽  
Author(s):  
Ryan Cook ◽  
Nathan Brown ◽  
Tamsin Redgwell ◽  
Branko Rihtman ◽  
Megan Barnes ◽  
...  
2021 ◽  
Author(s):  
Ryan Cook ◽  
Nathan Brown ◽  
Tamsin Redgwell ◽  
Branko Rihtman ◽  
Megan Barnes ◽  
...  

Background: With advances in sequencing technology and decreasing costs, the number of bacteriophage genomes that have been sequenced has increased markedly in the last decade. Materials and Methods: We developed an automated retrieval and analysis system for bacteriophage genomes, INPHARED (https://github.com/RyanCook94/inphared), that provides data in a consistent format. Results: As of January 2021, 14,244 complete phage genomes have been sequenced. The data set is dominated by phages that infect a small number of bacterial genera, with 75% of phages isolated only on 30 genera. There is further bias with significantly more lytic phage genomes than temperate within the database, resulting in ~54% of temperate phage genomes originating from just three host genera. Within phage genomes putative antibiotics resistance genes were found in higher frequencies in temperate phage than lytic phages. Conclusion: We provide a mechanism to reproducibly extract complete phage genomes and highlight some of the biases within this data, that underpins our current understanding of phage genomes.


2010 ◽  
Vol 16 (2) ◽  
pp. 254-265 ◽  
Author(s):  
Žilvinas Stankevičius ◽  
Giedrė Beconytė ◽  
Aušra Kalantaitė

Unified geo‐reference data model is a very important part of national geographic information management. It has been developed within the project of Lithuanian geographic information infrastructure in 2006–2008. This model allows automated integration of large scale (mainly municipality) geo‐reference data into the unified national geo‐reference database. It is based on unique object identifiers across all geo‐reference databases and on standard update and harmonisation procedures. The common stages of harmonisation of geo‐reference databases at different scales include: implementation of a unique identifier of geographic objects across all databases concerned; definition of the life cycle of the objects; definition of cohesion boundary and of the harmonisation points along the boundary; maintenance of the local database and automatic update of the national database using special service. When implemented, such model will significantly facilitate maintenance of national geo‐reference database and in five years from full implementation will have a significant economic effect. Santrauka Lietuvoje atlikta savivaldybėse kaupiamų erdvinių duomenų analizė parodė, kad tik didesniu miestų savivaldybės kaupia erdvinius duomenis, tačiau erdvinių duomenų sandaros skirtingos. Nacionaliniu lygmeniu kuriamos erdviniu duomenų bazės nesuderintos tarpusavyje, dubliuojamas erdviniu duomenų kaupimo procesas, orientuojantis į skirtingų masteliu žemelapių gamyba. Bendras georeferenciniu duomenų modelis (VGDM) apima georeferencinių duomenų konversija iš įvairių mastelių oficialių geografinių duomenų rinkinių, o ypač iš savivaldybių georeferencinių duomenų rinkinių į bendrą valstybės georeferencinių duomenų bazę (VGDB) ir nuolatinės VGDB atnaujinimo procedūras. VGDB atnaujinimo technologijos pagrindas ‐ geoobjektų (vektorinių geografinių duomenų elementų) egzistavimo ciklas ir pokyčių sekimas. Georeferencinių duomenų modelis reiškia, kad yra numatytas kelias pasiekti efektyvią įvairių mastelių oficialių duomenų bazių sąveiką.


2019 ◽  
Author(s):  
Zhi-Jie Cao ◽  
Lin Wei ◽  
Shen Lu ◽  
De-Chang Yang ◽  
Ge Gao

AbstractAn effective and efficient cell-querying method is critical for integrating existing scRNA-seq data and annotating new data. Herein, we present Cell BLAST, an accurate and robust cell-querying method. Powered by a well-curated reference database and a user-friendly Web server, Cell BLAST (http://cblast.gao-lab.org) provides a one-stop solution for real-world scRNA-seq cell querying and annotation.


2019 ◽  
Author(s):  
Shane L. Hubler ◽  
Praveen Kumar ◽  
Subina Mehta ◽  
Caleb Easterly ◽  
James E. Johnson ◽  
...  

AbstractWorkflows for large-scale (MS)-based shotgun proteomics can potentially lead to costly errors in the form of incorrect peptide spectrum matches (PSMs). To improve robustness of these workflows, we have investigated the use of the precursor mass discrepancy (PMD) to detect and filter potentially false PSMs that have, nonetheless, a high confidence score. We identified and addressed three cases of unexpected bias in PMD results: time of acquisition within a LC-MS run, decoy PSMs, and length of peptide. We created a post-analysis Bayesian confidence measure based on score and PMD, called PMD-FDR. We tested PMD-FDR on four datasets across three types of MS-based proteomics projects: standard (single organism; reference database), proteogenomics (single organism; customized genomic-based database plus reference), and metaproteomics (microorganism community; customized conglomerate database). On a ground truth dataset and other representative data, PMD-FDR was able to detect 60-80% of likely incorrect PSMs (false-hits) while losing only 5% of correct PSMs (true-hits). PMD-FDR can also be used to evaluate data quality for results generated within different experimental PSM-generating workflows, assisting in method development. Going forward, PMD-FDR should provide detection of high-scoring but likely false-hits, aiding applications which rely heavily on accurate PSMs, such as proteogenomics and metaproteomics.


2016 ◽  
Author(s):  
Shea N Gardner ◽  
Sasha K Ames ◽  
Maya B Gokhale ◽  
Tom R Slezak ◽  
Jonathan Allen

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.


2007 ◽  
Vol 33 (13) ◽  
pp. 1057-1059
Author(s):  
K. D. Dimakopoulos ◽  
D. G. Papageorgiou ◽  
I. N. Demetropoulos

2021 ◽  
Vol 4 ◽  
Author(s):  
Jeanine Brantschen ◽  
Rosetta Blackman ◽  
Jean-Claude Walser ◽  
Florian Altermatt

Anthropogenic activities are changing the state of ecosystems worldwide, affecting community composition and often resulting in loss of biodiversity. Riverine ecosystems are among the most impacted ecosystems. Recording their current state with regular biomonitoring is important to assess the future trajectory of biodiversity. However, traditional monitoring methods for ecological assessments are costly and time-intense. Here, we compare environmental DNA (eDNA) to traditional kick-net sampling in a standardized framework of surface water quality assessment. We use surveys of macroinvertebrate communities to assess biodiversity and the biological state of riverine systems. Both methods were employed to monitor aquatic macroinvertebrate indicator groups at 92 sites across major Swiss river catchments. The eDNA data were taxonomically assigned using a customised reference database. All zero-radius Operational Taxonomic Units (zOTUs) mapping to one of the 142 traditionally used indicator taxon levels were used for subsequent diversity analyses (n = 205). At the site level, eDNA detected less indicator taxa than the kick-net method and alpha diversity correlated only weakly between the methods. However, the methods showed a strong congruence in the overall community composition (gamma diversity), as the same indicator groups were commonly detected. In order to set the community composition in relation to the biotic index, the ecological states of the sampling sites were predicted by a random forest approach. Using all zOTUs mapping to macroinvertebrate indicator groups (n = 693) as predictive features, the random forest models successfully predicted the ecological status of the sampled sites. The majority of the predictions (71%) resulted in the same classification like the kick-net based scores. Thus, the sampling of eDNA enabled the detection of indicator communities and provided valuable classifications of the ecological state, when combined with machine learning. Overall, eDNA based sampling has the potential to complement traditional surveys of macroinvertebrate communities in routine large-scale assessments in a non-invasive and scalable approach.


2015 ◽  
Author(s):  
Peter Menzel ◽  
Kim Lee Ng ◽  
Anders Krogh

The constantly decreasing cost and increasing output of current sequencing technologies enable large scale metagenomic studies of microbial communities from diverse habitats. Therefore, fast and accurate methods for taxonomic classification are needed, which can operate on increasingly larger datasets and reference databases. Recently, several fast metagenomic classifiers have been developed, which are based on comparison of genomic k-mers. However, nucleotide comparison using a fixed k-mer length often lacks the sensitivity to overcome the evolutionary distance between sampled species and genomes in the reference database. Here, we present the novel metagenome classifier Kaiju for fast assignment of reads to taxa. Kaiju finds maximum exact matches on the protein-level using the Borrows-Wheeler transform, and can optionally allow amino acid substitutions in the search using a greedy heuristic. We show in a genome exclusion study that Kaiju can classify more reads with higher sensitivity and similar precision compared to fast k-mer based classifiers, especially in genera that are underrepresented in reference databases. We also demonstrate that Kaiju classifies more than twice as many reads in ten real metagenomes compared to programs based on genomic k-mers. Kaiju can process up to millions of reads per minute, and its memory footprint is below 6 GB of RAM, allowing the analysis on a standard PC. The program is available under the GPL3 license at: http://bioinformatics-centre.github.io/kaiju


2017 ◽  
Author(s):  
Harry A. Thorpe ◽  
Sion C. Bayliss ◽  
Samuel K. Sheppard ◽  
Edward J. Feil

AbstractDespite overwhelming evidence that variation in intergenic regions (IGRs) in bacteria impacts on phenotypes, most current approaches for analysing pan-genomes focus exclusively on protein-coding sequences. To address this we present Piggy, a novel pipeline that emulates Roary except that it is based only on IGRs. We demonstrate the use of Piggy for pan-genome analyses of Staphylococcus aureus and Escherichia coli using large genome datasets. For S. aureus, we show that highly divergent (“switched”) IGRs are associated with differences in gene expression, and we establish a multi-locus reference database of IGR alleles (igMLST; implemented in BIGSdb). Piggy is available at https://github.com/harry-thorpe/piggy.


2021 ◽  
Vol 4 ◽  
Author(s):  
Virginie Marques ◽  
Tristan Milhau ◽  
Camille Albouy ◽  
Tony Dejean ◽  
Stéphanie Manel ◽  
...  

Environmental DNA metabarcoding has recently emerged as a non-invasive tool for aquatic biodiversity inventories, frequently surpassing traditional methods for detecting a wide range of taxa in most habitats. One of the major limitations currently impairing the large-scale application of DNA-based inventories, such as eDNA or bulk-sample analysis is the lack of species sequences available in public genetic databases. These gaps are still largely unknown spatially and taxonomically for most regions of the world, which can hinder targeted future sequencing efforts. We propose GAPeDNA, a user-friendly web-interface (Fig. 1) that provides a global overview of genetic database completeness for a given taxon across space and conservation status. As an initial application, we synthetized data from regional checklists for marine and freshwater fishes along with their IUCN conservation status to provide global maps of species coverage using the European Nucleotide Archive public reference database for 19 metabarcoding primers. This tool automatizes the scanning of gaps in these databases to guide future sequencing efforts and support the deployment of DNA-based inventories at larger scale. It is flexible and can be expanded to other taxa and primers upon data availability. Using our global fish case study, we show that gaps increase toward the tropics where species diversity and the number of threatened species were the highest. It highlights priority areas for fish sequencing like the Congo, the Mekong and the Mississippi freshwater basins which host more than 60 non-sequenced threatened fish species. For marine fishes, the Caribbean and East Africa host up to 42 non-sequenced threatened species. As an open-acces, updatable and flexible tool, GAPeDNA can be used to evaluate the completeness of sequence reference libraries of various markers and for any taxonomic group.


Sign in / Sign up

Export Citation Format

Share Document