scholarly journals Biases in genome reconstruction from metagenomic data

PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e10119
Author(s):  
William C. Nelson ◽  
Benjamin J. Tully ◽  
Jennifer M. Mobberley

Background Advances in sequencing, assembly, and assortment of contigs into species-specific bins has enabled the reconstruction of genomes from metagenomic data (MAGs). Though a powerful technique, it is difficult to determine whether assembly and binning techniques are accurate when applied to environmental metagenomes due to a lack of complete reference genome sequences against which to check the resulting MAGs. Methods We compared MAGs derived from an enrichment culture containing ~20 organisms to complete genome sequences of 10 organisms isolated from the enrichment culture. Factors commonly considered in binning software—nucleotide composition and sequence repetitiveness—were calculated for both the correctly binned and not-binned regions. This direct comparison revealed biases in sequence characteristics and gene content in the not-binned regions. Additionally, the composition of three public data sets representing MAGs reconstructed from the Tara Oceans metagenomic data was compared to a set of representative genomes available through NCBI RefSeq to verify that the biases identified were observable in more complex data sets and using three contemporary binning software packages. Results Repeat sequences were frequently not binned in the genome reconstruction processes, as were sequence regions with variant nucleotide composition. Genes encoded on the not-binned regions were strongly biased towards ribosomal RNAs, transfer RNAs, mobile element functions and genes of unknown function. Our results support genome reconstruction as a robust process and suggest that reconstructions determined to be >90% complete are likely to effectively represent organismal function; however, population-level genotypic heterogeneity in natural populations, such as uneven distribution of plasmids, can lead to incorrect inferences.

Author(s):  
William C Nelson ◽  
Jennifer M Mobberley

Background: Technological advances in sequencing, assembly and segregation of resulting contigs into species-specific bins has enabled the reconstruction of individual genomes from environmental metagenomic data sets. Though a powerful technique, it is shadowed by an inability to truly determine whether assembly and binning techniques are accurate, specific, and sensitive due to a lack of complete reference genome sequences against which to check the data. Errors in genome reconstruction, such as missing or mis-attributed activities, can have a detrimental effect on downstream metabolic and ecological modeling, and thus it is important to assess the accuracy of the process. Methods: We compared genomes reconstructed from metagenomic data to complete genome sequences of 10 organisms isolated from the same community to identify regions not captured by typical binning techniques. The nucleotide content, as %G+C and tetranucleotide frequencies, and sequence redundancy within both the genome and across the metagenome were determined for both the captured and uncaptured regions. This direct comparison allowed us to evaluate the efficacy of nucleotide composition and coverage profiles as elements of binning protocols and look for biases in sequence characteristics and gene content in regions missing from the reconstructions. Results: We found that repeated sequences were frequently missed in the reconstruction process as were short sequences with variant nucleotide composition. Genes encoded on the missing regions were strongly biased towards ribosomal RNAs, transfer RNAs, mobile element functions and genes of unknown function. Conclusions: Our observation of increased mis-binning of short regions, especially those with variant nucleotide content, and repeated regions implies that factors which affect assembly efficiency also impact binning accuracy. To a large extent, mis-binned regions appear to derive from mobile elements. Our results support genome reconstruction as a robust process, and suggest that reconstructions determined to be >90% complete are likely to effectively represent organismal function.


2017 ◽  
Author(s):  
William C Nelson ◽  
Jennifer M Mobberley

Background: Technological advances in sequencing, assembly and segregation of resulting contigs into species-specific bins has enabled the reconstruction of individual genomes from environmental metagenomic data sets. Though a powerful technique, it is shadowed by an inability to truly determine whether assembly and binning techniques are accurate, specific, and sensitive due to a lack of complete reference genome sequences against which to check the data. Errors in genome reconstruction, such as missing or mis-attributed activities, can have a detrimental effect on downstream metabolic and ecological modeling, and thus it is important to assess the accuracy of the process. Methods: We compared genomes reconstructed from metagenomic data to complete genome sequences of 10 organisms isolated from the same community to identify regions not captured by typical binning techniques. The nucleotide content, as %G+C and tetranucleotide frequencies, and sequence redundancy within both the genome and across the metagenome were determined for both the captured and uncaptured regions. This direct comparison allowed us to evaluate the efficacy of nucleotide composition and coverage profiles as elements of binning protocols and look for biases in sequence characteristics and gene content in regions missing from the reconstructions. Results: We found that repeated sequences were frequently missed in the reconstruction process as were short sequences with variant nucleotide composition. Genes encoded on the missing regions were strongly biased towards ribosomal RNAs, transfer RNAs, mobile element functions and genes of unknown function. Conclusions: Our observation of increased mis-binning of short regions, especially those with variant nucleotide content, and repeated regions implies that factors which affect assembly efficiency also impact binning accuracy. To a large extent, mis-binned regions appear to derive from mobile elements. Our results support genome reconstruction as a robust process, and suggest that reconstructions determined to be >90% complete are likely to effectively represent organismal function.


2019 ◽  
Vol 8 (23) ◽  
Author(s):  
Ignacio de la Higuera ◽  
Ellis L. Torrance ◽  
Alyssa A. Pratt ◽  
George W. Kasun ◽  
Amberlee Maluenda ◽  
...  

Cruciviruses are single-stranded DNA (ssDNA) viruses whose genomes suggest the possibility of gene transfer between DNA and RNA viruses. Many crucivirus genome sequences have been found in metagenomic data sets, although no crucivirus has been isolated.


PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e6695 ◽  
Author(s):  
Andrea Garretto ◽  
Thomas Hatzopoulos ◽  
Catherine Putonti

Metagenomics has enabled sequencing of viral communities from a myriad of different environments. Viral metagenomic studies routinely uncover sequences with no recognizable homology to known coding regions or genomes. Nevertheless, complete viral genomes have been constructed directly from complex community metagenomes, often through tedious manual curation. To address this, we developed the software tool virMine to identify viral genomes from raw reads representative of viral or mixed (viral and bacterial) communities. virMine automates sequence read quality control, assembly, and annotation. Researchers can easily refine their search for a specific study system and/or feature(s) of interest. In contrast to other viral genome detection tools that often rely on the recognition of viral signature sequences, virMine is not restricted by the insufficient representation of viral diversity in public data repositories. Rather, viral genomes are identified through an iterative approach, first omitting non-viral sequences. Thus, both relatives of previously characterized viruses and novel species can be detected, including both eukaryotic viruses and bacteriophages. Here we present virMine and its analysis of synthetic communities as well as metagenomic data sets from three distinctly different environments: the gut microbiota, the urinary microbiota, and freshwater viromes. Several new viral genomes were identified and annotated, thus contributing to our understanding of viral genetic diversity in these three environments.


2020 ◽  
Vol 9 (21) ◽  
Author(s):  
Emily Wei-Hsin Sun ◽  
Sassan Hajirezaie ◽  
Mackenzie Dooner ◽  
Tatiana A. Vishnivetskaya ◽  
Alice Layton ◽  
...  

ABSTRACT The role of archaeal ammonia oxidizers often exceeds that of bacterial ammonia oxidizers in marine and terrestrial environments but has been understudied in permafrost, where thawing has the potential to release ammonia. Here, three thaumarchaea genomes were assembled and annotated from metagenomic data sets from carbon-poor Canadian High Arctic active-layer cryosols.


2021 ◽  
Vol 16 (1) ◽  
pp. 1-24
Author(s):  
Yaojin Lin ◽  
Qinghua Hu ◽  
Jinghua Liu ◽  
Xingquan Zhu ◽  
Xindong Wu

In multi-label learning, label correlations commonly exist in the data. Such correlation not only provides useful information, but also imposes significant challenges for multi-label learning. Recently, label-specific feature embedding has been proposed to explore label-specific features from the training data, and uses feature highly customized to the multi-label set for learning. While such feature embedding methods have demonstrated good performance, the creation of the feature embedding space is only based on a single label, without considering label correlations in the data. In this article, we propose to combine multiple label-specific feature spaces, using label correlation, for multi-label learning. The proposed algorithm, mu lti- l abel-specific f eature space e nsemble (MULFE), takes consideration label-specific features, label correlation, and weighted ensemble principle to form a learning framework. By conducting clustering analysis on each label’s negative and positive instances, MULFE first creates features customized to each label. After that, MULFE utilizes the label correlation to optimize the margin distribution of the base classifiers which are induced by the related label-specific feature spaces. By combining multiple label-specific features, label correlation based weighting, and ensemble learning, MULFE achieves maximum margin multi-label classification goal through the underlying optimization framework. Empirical studies on 10 public data sets manifest the effectiveness of MULFE.


GigaScience ◽  
2021 ◽  
Vol 10 (2) ◽  
Author(s):  
Guilhem Sempéré ◽  
Adrien Pétel ◽  
Magsen Abbé ◽  
Pierre Lefeuvre ◽  
Philippe Roumagnac ◽  
...  

Abstract Background Efficiently managing large, heterogeneous data in a structured yet flexible way is a challenge to research laboratories working with genomic data. Specifically regarding both shotgun- and metabarcoding-based metagenomics, while online reference databases and user-friendly tools exist for running various types of analyses (e.g., Qiime, Mothur, Megan, IMG/VR, Anvi'o, Qiita, MetaVir), scientists lack comprehensive software for easily building scalable, searchable, online data repositories on which they can rely during their ongoing research. Results metaXplor is a scalable, distributable, fully web-interfaced application for managing, sharing, and exploring metagenomic data. Being based on a flexible NoSQL data model, it has few constraints regarding dataset contents and thus proves useful for handling outputs from both shotgun and metabarcoding techniques. By supporting incremental data feeding and providing means to combine filters on all imported fields, it allows for exhaustive content browsing, as well as rapid narrowing to find specific records. The application also features various interactive data visualization tools, ways to query contents by BLASTing external sequences, and an integrated pipeline to enrich assignments with phylogenetic placements. The project home page provides the URL of a live instance allowing users to test the system on public data. Conclusion metaXplor allows efficient management and exploration of metagenomic data. Its availability as a set of Docker containers, making it easy to deploy on academic servers, on the cloud, or even on personal computers, will facilitate its adoption.


2021 ◽  
pp. 016555152199863
Author(s):  
Ismael Vázquez ◽  
María Novo-Lourés ◽  
Reyes Pavón ◽  
Rosalía Laza ◽  
José Ramón Méndez ◽  
...  

Current research has evolved in such a way scientists must not only adequately describe the algorithms they introduce and the results of their application, but also ensure the possibility of reproducing the results and comparing them with those obtained through other approximations. In this context, public data sets (sometimes shared through repositories) are one of the most important elements for the development of experimental protocols and test benches. This study has analysed a significant number of CS/ML ( Computer Science/ Machine Learning) research data repositories and data sets and detected some limitations that hamper their utility. Particularly, we identify and discuss the following demanding functionalities for repositories: (1) building customised data sets for specific research tasks, (2) facilitating the comparison of different techniques using dissimilar pre-processing methods, (3) ensuring the availability of software applications to reproduce the pre-processing steps without using the repository functionalities and (4) providing protection mechanisms for licencing issues and user rights. To show the introduced functionality, we created STRep (Spam Text Repository) web application which implements our recommendations adapted to the field of spam text repositories. In addition, we launched an instance of STRep in the URL https://rdata.4spam.group to facilitate understanding of this study.


2014 ◽  
Vol 104 (10) ◽  
pp. 1125-1129 ◽  
Author(s):  
A. H. Stobbe ◽  
W. L. Schneider ◽  
P. R. Hoyt ◽  
U. Melcher

Next generation sequencing (NGS) is not used commonly in diagnostics, in part due to the large amount of time and computational power needed to identify the taxonomic origin of each sequence in a NGS data set. By using the unassembled NGS data sets as the target for searches, pathogen-specific sequences, termed e-probes, could be used as queries to enable detection of specific viruses or organisms in plant sample metagenomes. This method, designated e-probe diagnostic nucleic acid assay, first tested with mock sequence databases, was tested with NGS data sets generated from plants infected with a DNA (Bean golden yellow mosaic virus, BGYMV) or an RNA (Plum pox virus, PPV) virus. In addition, the ability to detect and differentiate among strains of a single virus species, PPV, was examined by using probe sets that were specific to strains. The use of probe sets for multiple viruses determined that one sample was dually infected with BGYMV and Bean golden mosaic virus.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Jiawei Lian ◽  
Junhong He ◽  
Yun Niu ◽  
Tianze Wang

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.


Sign in / Sign up

Export Citation Format

Share Document