scholarly journals Biases in genome reconstruction from metagenomic data

Author(s):  
William C Nelson ◽  
Jennifer M Mobberley

Background: Technological advances in sequencing, assembly and segregation of resulting contigs into species-specific bins has enabled the reconstruction of individual genomes from environmental metagenomic data sets. Though a powerful technique, it is shadowed by an inability to truly determine whether assembly and binning techniques are accurate, specific, and sensitive due to a lack of complete reference genome sequences against which to check the data. Errors in genome reconstruction, such as missing or mis-attributed activities, can have a detrimental effect on downstream metabolic and ecological modeling, and thus it is important to assess the accuracy of the process. Methods: We compared genomes reconstructed from metagenomic data to complete genome sequences of 10 organisms isolated from the same community to identify regions not captured by typical binning techniques. The nucleotide content, as %G+C and tetranucleotide frequencies, and sequence redundancy within both the genome and across the metagenome were determined for both the captured and uncaptured regions. This direct comparison allowed us to evaluate the efficacy of nucleotide composition and coverage profiles as elements of binning protocols and look for biases in sequence characteristics and gene content in regions missing from the reconstructions. Results: We found that repeated sequences were frequently missed in the reconstruction process as were short sequences with variant nucleotide composition. Genes encoded on the missing regions were strongly biased towards ribosomal RNAs, transfer RNAs, mobile element functions and genes of unknown function. Conclusions: Our observation of increased mis-binning of short regions, especially those with variant nucleotide content, and repeated regions implies that factors which affect assembly efficiency also impact binning accuracy. To a large extent, mis-binned regions appear to derive from mobile elements. Our results support genome reconstruction as a robust process, and suggest that reconstructions determined to be >90% complete are likely to effectively represent organismal function.

2017 ◽  
Author(s):  
William C Nelson ◽  
Jennifer M Mobberley

Background: Technological advances in sequencing, assembly and segregation of resulting contigs into species-specific bins has enabled the reconstruction of individual genomes from environmental metagenomic data sets. Though a powerful technique, it is shadowed by an inability to truly determine whether assembly and binning techniques are accurate, specific, and sensitive due to a lack of complete reference genome sequences against which to check the data. Errors in genome reconstruction, such as missing or mis-attributed activities, can have a detrimental effect on downstream metabolic and ecological modeling, and thus it is important to assess the accuracy of the process. Methods: We compared genomes reconstructed from metagenomic data to complete genome sequences of 10 organisms isolated from the same community to identify regions not captured by typical binning techniques. The nucleotide content, as %G+C and tetranucleotide frequencies, and sequence redundancy within both the genome and across the metagenome were determined for both the captured and uncaptured regions. This direct comparison allowed us to evaluate the efficacy of nucleotide composition and coverage profiles as elements of binning protocols and look for biases in sequence characteristics and gene content in regions missing from the reconstructions. Results: We found that repeated sequences were frequently missed in the reconstruction process as were short sequences with variant nucleotide composition. Genes encoded on the missing regions were strongly biased towards ribosomal RNAs, transfer RNAs, mobile element functions and genes of unknown function. Conclusions: Our observation of increased mis-binning of short regions, especially those with variant nucleotide content, and repeated regions implies that factors which affect assembly efficiency also impact binning accuracy. To a large extent, mis-binned regions appear to derive from mobile elements. Our results support genome reconstruction as a robust process, and suggest that reconstructions determined to be >90% complete are likely to effectively represent organismal function.


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e10119
Author(s):  
William C. Nelson ◽  
Benjamin J. Tully ◽  
Jennifer M. Mobberley

Background Advances in sequencing, assembly, and assortment of contigs into species-specific bins has enabled the reconstruction of genomes from metagenomic data (MAGs). Though a powerful technique, it is difficult to determine whether assembly and binning techniques are accurate when applied to environmental metagenomes due to a lack of complete reference genome sequences against which to check the resulting MAGs. Methods We compared MAGs derived from an enrichment culture containing ~20 organisms to complete genome sequences of 10 organisms isolated from the enrichment culture. Factors commonly considered in binning software—nucleotide composition and sequence repetitiveness—were calculated for both the correctly binned and not-binned regions. This direct comparison revealed biases in sequence characteristics and gene content in the not-binned regions. Additionally, the composition of three public data sets representing MAGs reconstructed from the Tara Oceans metagenomic data was compared to a set of representative genomes available through NCBI RefSeq to verify that the biases identified were observable in more complex data sets and using three contemporary binning software packages. Results Repeat sequences were frequently not binned in the genome reconstruction processes, as were sequence regions with variant nucleotide composition. Genes encoded on the not-binned regions were strongly biased towards ribosomal RNAs, transfer RNAs, mobile element functions and genes of unknown function. Our results support genome reconstruction as a robust process and suggest that reconstructions determined to be >90% complete are likely to effectively represent organismal function; however, population-level genotypic heterogeneity in natural populations, such as uneven distribution of plasmids, can lead to incorrect inferences.


2019 ◽  
Vol 8 (23) ◽  
Author(s):  
Ignacio de la Higuera ◽  
Ellis L. Torrance ◽  
Alyssa A. Pratt ◽  
George W. Kasun ◽  
Amberlee Maluenda ◽  
...  

Cruciviruses are single-stranded DNA (ssDNA) viruses whose genomes suggest the possibility of gene transfer between DNA and RNA viruses. Many crucivirus genome sequences have been found in metagenomic data sets, although no crucivirus has been isolated.


2020 ◽  
Vol 9 (21) ◽  
Author(s):  
Emily Wei-Hsin Sun ◽  
Sassan Hajirezaie ◽  
Mackenzie Dooner ◽  
Tatiana A. Vishnivetskaya ◽  
Alice Layton ◽  
...  

ABSTRACT The role of archaeal ammonia oxidizers often exceeds that of bacterial ammonia oxidizers in marine and terrestrial environments but has been understudied in permafrost, where thawing has the potential to release ammonia. Here, three thaumarchaea genomes were assembled and annotated from metagenomic data sets from carbon-poor Canadian High Arctic active-layer cryosols.


2015 ◽  
Vol 112 (50) ◽  
pp. 15450-15455 ◽  
Author(s):  
Mallory Embree ◽  
Joanne K. Liu ◽  
Mahmoud M. Al-Bassam ◽  
Karsten Zengler

Microorganisms form diverse communities that have a profound impact on the environment and human health. Recent technological advances have enabled elucidation of community diversity at high resolution. Investigation of microbial communities has revealed that they often contain multiple members with complementing and seemingly redundant metabolic capabilities. An understanding of the communal impacts of redundant metabolic capabilities is currently lacking; specifically, it is not known whether metabolic redundancy will foster competition or motivate cooperation. By investigating methanogenic populations, we identified the multidimensional interspecies interactions that define composition and dynamics within syntrophic communities that play a key role in the global carbon cycle. Species-specific genomes were extracted from metagenomic data using differential coverage binning. We used metabolic modeling leveraging metatranscriptomic information to reveal and quantify a complex intertwined system of syntrophic relationships. Our results show that amino acid auxotrophies create additional interdependencies that define community composition and control carbon and energy flux through the system while simultaneously contributing to overall community robustness. Strategic use of antimicrobials further reinforces this intricate interspecies network. Collectively, our study reveals the multidimensional interactions in syntrophic communities that promote high species richness and bolster community stability during environmental perturbations.


Pathogens ◽  
2021 ◽  
Vol 10 (2) ◽  
pp. 86
Author(s):  
Erin M. Garcia ◽  
Myrna G. Serrano ◽  
Laahirie Edupuganti ◽  
David J. Edwards ◽  
Gregory A. Buck ◽  
...  

Gardnerella vaginalis has recently been split into 13 distinct species. In this study, we tested the hypotheses that species-specific variations in the vaginolysin (VLY) amino acid sequence could influence the interaction between the toxin and vaginal epithelial cells and that VLY variation may be one factor that distinguishes less virulent or commensal strains from more virulent strains. This was assessed by bioinformatic analyses of publicly available Gardnerella spp. sequences and quantification of cytotoxicity and cytokine production from purified, recombinantly produced versions of VLY. After identifying conserved differences that could distinguish distinct VLY types, we analyzed metagenomic data from a cohort of female subjects from the Vaginal Human Microbiome Project to investigate whether these different VLY types exhibited any significant associations with symptoms or Gardnerella spp.-relative abundance in vaginal swab samples. While Type 1 VLY was most prevalent among the subjects and may be associated with increased reports of symptoms, subjects with Type 2 VLY dominant profiles exhibited increased relative Gardnerella spp. abundance. Our findings suggest that amino acid differences alter the interaction of VLY with vaginal keratinocytes, which may potentiate differences in bacterial vaginosis (BV) immunopathology in vivo.


Genetics ◽  
2001 ◽  
Vol 159 (3) ◽  
pp. 1191-1199
Author(s):  
Araxi O Urrutia ◽  
Laurence D Hurst

Abstract In numerous species, from bacteria to Drosophila, evidence suggests that selection acts even on synonymous codon usage: codon bias is greater in more abundantly expressed genes, the rate of synonymous evolution is lower in genes with greater codon bias, and there is consistency between genes in the same species in which codons are preferred. In contrast, in mammals, while nonequal use of alternative codons is observed, the bias is attributed to the background variance in nucleotide concentrations, reflected in the similar nucleotide composition of flanking noncoding and exonic third sites. However, a systematic examination of the covariants of codon usage controlling for background nucleotide content has yet to be performed. Here we present a new method to measure codon bias that corrects for background nucleotide content and apply this to 2396 human genes. Nearly all (99%) exhibit a higher amount of codon bias than expected by chance. The patterns associated with selectively driven codon bias are weakly recovered: Broadly expressed genes have a higher level of bias than do tissue-specific genes, the bias is higher for genes with lower rates of synonymous substitutions, and certain codons are repeatedly preferred. However, while these patterns are suggestive, the first two patterns appear to be methodological artifacts. The last pattern reflects in part biases in usage of nucleotide pairs. We conclude that we find no evidence for selection on codon usage in humans.


2021 ◽  
pp. 1-13
Author(s):  
Yikai Zhang ◽  
Yong Peng ◽  
Hongyu Bian ◽  
Yuan Ge ◽  
Feiwei Qin ◽  
...  

Concept factorization (CF) is an effective matrix factorization model which has been widely used in many applications. In CF, the linear combination of data points serves as the dictionary based on which CF can be performed in both the original feature space as well as the reproducible kernel Hilbert space (RKHS). The conventional CF treats each dimension of the feature vector equally during the data reconstruction process, which might violate the common sense that different features have different discriminative abilities and therefore contribute differently in pattern recognition. In this paper, we introduce an auto-weighting variable into the conventional CF objective function to adaptively learn the corresponding contributions of different features and propose a new model termed Auto-Weighted Concept Factorization (AWCF). In AWCF, on one hand, the feature importance can be quantitatively measured by the auto-weighting variable in which the features with better discriminative abilities are assigned larger weights; on the other hand, we can obtain more efficient data representation to depict its semantic information. The detailed optimization procedure to AWCF objective function is derived whose complexity and convergence are also analyzed. Experiments are conducted on both synthetic and representative benchmark data sets and the clustering results demonstrate the effectiveness of AWCF in comparison with the related models.


2014 ◽  
Vol 104 (10) ◽  
pp. 1125-1129 ◽  
Author(s):  
A. H. Stobbe ◽  
W. L. Schneider ◽  
P. R. Hoyt ◽  
U. Melcher

Next generation sequencing (NGS) is not used commonly in diagnostics, in part due to the large amount of time and computational power needed to identify the taxonomic origin of each sequence in a NGS data set. By using the unassembled NGS data sets as the target for searches, pathogen-specific sequences, termed e-probes, could be used as queries to enable detection of specific viruses or organisms in plant sample metagenomes. This method, designated e-probe diagnostic nucleic acid assay, first tested with mock sequence databases, was tested with NGS data sets generated from plants infected with a DNA (Bean golden yellow mosaic virus, BGYMV) or an RNA (Plum pox virus, PPV) virus. In addition, the ability to detect and differentiate among strains of a single virus species, PPV, was examined by using probe sets that were specific to strains. The use of probe sets for multiple viruses determined that one sample was dually infected with BGYMV and Bean golden mosaic virus.


2017 ◽  
Vol 5 (11) ◽  
Author(s):  
Catherine M. Mageeney ◽  
Cimrin Bhalla ◽  
Charles A. Bowman ◽  
Bhavishya Devireddy ◽  
Adrienne P. Dzurick ◽  
...  

ABSTRACT Jane and Sneeze are newly isolated phages of Mycobacterium smegmatis mc2155 from Hillsborough, NJ, and Palo Verde, Costa Rica, respectively. Both are cluster G, subcluster G1 mycobacteriophages. Notable nucleotide differences exist between genomes in the right half, including the presence of mycobacteriophage mobile element 1 (MPME1) in Jane.


Sign in / Sign up

Export Citation Format

Share Document