scholarly journals MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs

2021 ◽  
Author(s):  
Vijini Mallawaarachchi ◽  
Yu Lin

ABSTRACTMetagenomics binning has allowed us to study and characterize various genetic material of different species and gain insights into microbial communities. While existing binning tools bin metagenomics de novo assemblies, they do not make use of the assembly graphs that produce such assemblies. Here we propose MetaCoAG, a tool that utilizes assembly graphs with the composition and coverage information to bin metagenomic contigs. MetaCoAG uses single-copy marker genes to estimate the number of initial bins, assigns contigs into bins iteratively and adjusts the number of bins dynamically throughout the binning process. Experimental results on simulated and real datasets demonstrate that MetaCoAG significantly outperforms state-of-the-art binning tools, producing more high-quality bins than the second-best tool, with an average median F1-score of 88.40%. To the best of our knowledge, MetaCoAG is the first stand-alone binning tool to make direct use of the assembly graph information. MetaCoAG is available at https://github.com/Vini2/MetaCoAG.

2019 ◽  
Vol 10 (2) ◽  
pp. 475-478 ◽  
Author(s):  
Nicholas A. Mason ◽  
Paulo Pulgarin ◽  
Carlos Daniel Cadena ◽  
Irby J. Lovette

The Horned Lark (Eremophila alpestris) is a small songbird that exhibits remarkable geographic variation in appearance and habitat across an expansive distribution. While E. alpestris has been the focus of many ecological and evolutionary studies, we still lack a highly contiguous genome assembly for the Horned Lark and related taxa (Alaudidae). Here, we present CLO_EAlp_1.0, a highly contiguous assembly for E. alpestris generated from a blood sample of a wild, male bird captured in the Altiplano Cundiboyacense of Colombia. By combining short-insert and mate-pair libraries with the ALLPATHS-LG genome assembly pipeline, we generated a 1.04 Gb assembly comprised of 2713 scaffolds, with a largest scaffold size of 31.81 Mb, a scaffold N50 of 9.42 Mb, and a scaffold L50 of 30. These scaffolds were assembled from 23685 contigs, with a largest contig size of 1.69 Mb, a contig N50 of 193.81 kb, and a contig L50 of 1429. Our assembly pipeline also produced a single mitochondrial DNA contig of 14.00 kb. After polishing the genome, we identified 94.5% of single-copy gene orthologs from an Aves data set and 97.7% of single-copy gene orthologs from a vertebrata data set, which further demonstrates the high quality of our assembly. We anticipate that this genomic resource will be useful to the broader ornithological community and those interested in studying the evolutionary history and ecological interactions of larks, which comprise a widespread, yet understudied lineage of songbirds.


Author(s):  
Tianxing Wu ◽  
Guilin Qi ◽  
Bin Luo ◽  
Lei Zhang ◽  
Haofen Wang

Extracting knowledge from Wikipedia has attracted much attention in recent ten years. One of the most valuable kinds of knowledge is type information, which refers to the axioms stating that an instance is of a certain type. Current approaches for inferring the types of instances from Wikipedia mainly rely on some language-specific rules. Since these rules cannot catch the semantic associations between instances and classes (i.e. candidate types), it may lead to mistakes and omissions in the process of type inference. The authors propose a new approach leveraging attributes to perform language-independent type inference of the instances from Wikipedia. The proposed approach is applied to the whole English and Chinese Wikipedia, which results in the first version of MulType (Multilingual Type Information), a knowledge base describing the types of instances from multilingual Wikipedia. Experimental results show that not only the proposed approach outperforms the state-of-the-art comparison methods, but also MulType contains lots of new and high-quality type information.


Author(s):  
Ziming Li ◽  
Julia Kiseleva ◽  
Maarten De Rijke

The performance of adversarial dialogue generation models relies on the quality of the reward signal produced by the discriminator. The reward signal from a poor discriminator can be very sparse and unstable, which may lead the generator to fall into a local optimum or to produce nonsense replies. To alleviate the first problem, we first extend a recently proposed adversarial dialogue generation method to an adversarial imitation learning solution. Then, in the framework of adversarial inverse reinforcement learning, we propose a new reward model for dialogue generation that can provide a more accurate and precise reward signal for generator training. We evaluate the performance of the resulting model with automatic metrics and human evaluations in two annotation settings. Our experimental results demonstrate that our model can generate more high-quality responses and achieve higher overall performance than the state-of-the-art.


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e9114 ◽  
Author(s):  
Jiawei Wang ◽  
Weizhen Liu ◽  
Dongzi Zhu ◽  
Xiang Zhou ◽  
Po Hong ◽  
...  

The sweet cherry (Prunus avium) is one of the most economically important fruit species in the world. However, there is a limited amount of genetic information available for this species, which hinders breeding efforts at a molecular level. We were able to describe a high-quality reference genome assembly and annotation of the diploid sweet cherry (2n = 2x = 16) cv. Tieton using linked-read sequencing technology. We generated over 750 million clean reads, representing 112.63 GB of raw sequencing data. The Supernova assembler produced a more highly-ordered and continuous genome sequence than the current P. avium draft genome, with a contig N50 of 63.65 KB and a scaffold N50 of 2.48 MB. The final scaffold assembly was 280.33 MB in length, representing 82.12% of the estimated Tieton genome. Eight chromosome-scale pseudomolecules were constructed, completing a 214 MB sequence of the final scaffold assembly. De novo, homology-based, and RNA-seq methods were used together to predict 30,975 protein-coding loci. 98.39% of core eukaryotic genes and 97.43% of single copy orthologues were identified in the embryo plant, indicating the completeness of the assembly. Linked-read sequencing technology was effective in constructing a high-quality reference genome of the sweet cherry, which will benefit the molecular breeding and cultivar identification in this species.


2022 ◽  
pp. 580-606
Author(s):  
Tianxing Wu ◽  
Guilin Qi ◽  
Bin Luo ◽  
Lei Zhang ◽  
Haofen Wang

Extracting knowledge from Wikipedia has attracted much attention in recent ten years. One of the most valuable kinds of knowledge is type information, which refers to the axioms stating that an instance is of a certain type. Current approaches for inferring the types of instances from Wikipedia mainly rely on some language-specific rules. Since these rules cannot catch the semantic associations between instances and classes (i.e. candidate types), it may lead to mistakes and omissions in the process of type inference. The authors propose a new approach leveraging attributes to perform language-independent type inference of the instances from Wikipedia. The proposed approach is applied to the whole English and Chinese Wikipedia, which results in the first version of MulType (Multilingual Type Information), a knowledge base describing the types of instances from multilingual Wikipedia. Experimental results show that not only the proposed approach outperforms the state-of-the-art comparison methods, but also MulType contains lots of new and high-quality type information.


2020 ◽  
Author(s):  
Christopher Quince ◽  
Sergey Nurk ◽  
Sebastien Raguideau ◽  
Robert James ◽  
Orkun S. Soyer ◽  
...  

AbstractWe introduce a novel bioinformatics pipeline, STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, when multiple metagenome samples from the same community are available. STRONG performs coassembly, followed by binning into metagenome assembled genomes (MAGs), but uniquely it stores the coassembly graph prior to simplification of variants. This enables the subgraphs for individual single-copy core genes (SCGs) in each MAG to be extracted. It can then thread back reads from the samples to compute per sample coverages for the unitigs in these graphs. These graphs and their unitig coverages are then used in a Bayesian algorithm, BayesPaths, that determines the number of strains present, their sequences or haplotypes on the SCGs and their abundances in each of the samples.Our approach both avoids the ambiguities of read mapping and allows more of the information on co-occurrence of variants in reads to be utilised than if variants were treated independently, whilst at the same time exploiting the correlation of variants across samples that occurs when they are linked in the same strain. We compare STRONG to the current state of the art on synthetic communities and demonstrate that we can recover more strains, more accurately, and with a realistic estimate of uncertainty deriving from the variational Bayesian algorithm employed for the strain resolution. On a real anaerobic digestor time series we obtained strain-resolved SCGs for over 300 MAGs that for abundant community members match those observed from long Nanopore reads.


2021 ◽  
Author(s):  
Solize Vosloo ◽  
Linxuan Huo ◽  
Christopher L Anderson ◽  
Maria Sevillano Rivera ◽  
Zihan Dai ◽  
...  

Reconstructing microbial genomes from metagenomic short-read data can be challenging due to the unknown and uneven complexity of microbial communities. This complexity encompasses highly diverse populations which often includes strain variants. Reconstructing high-quality genomes is a crucial part of the metagenomic workflow as subsequent ecological and metabolic inferences depend on their accuracy, quality, and completeness. In contrast to microbial communities in other ecosystems, there has been no systematic assessment of genome-centric metagenomic workflows for drinking water microbiomes. In this study, we assessed the performance of a combination of assembly and binning strategies for time-series drinking water metagenomes that were collected over a period of 6 months. The goal of this study was to identify the combination of assembly and binning approaches that results in high quality and quantity metagenome-assembled genomes (MAGs), representing most of the sequenced metagenome. Our findings suggest that the metaSPAdes co-assembly strategies had the best performance as they resulted in larger and less fragmented assemblies with at least 85% of the sequence data mapping to contigs greater than 1kbp. Furthermore, a combination of metaSPAdes co-assembly strategies and MetaBAT2 produced the highest number of medium-quality MAGs while capturing at least 70% of the metagenomes based on read recruitment. Utilizing different assembly/binning approaches also assist in the reconstruction of unique MAGs from closely related species that would have otherwise collapsed into a single MAG using a single workflow. Overall, our study suggests that leveraging multiple binning approaches with different metaSPAdes co-assembly strategies may be required to maximize the recovery of good-quality MAGs, which more accurately capture the microbial diversity of drinking water samples.


Author(s):  
Yu-Lun Liu ◽  
Yi-Tung Liao ◽  
Yen-Yu Lin ◽  
Yung-Yu Chuang

Video frame interpolation algorithms predict intermediate frames to produce videos with higher frame rates and smooth view transitions given two consecutive frames as inputs. We propose that: synthesized frames are more reliable if they can be used to reconstruct the input frames with high quality. Based on this idea, we introduce a new loss term, the cycle consistency loss. The cycle consistency loss can better utilize the training data to not only enhance the interpolation results, but also maintain the performance better with less training data. It can be integrated into any frame interpolation network and trained in an end-to-end manner. In addition to the cycle consistency loss, we propose two extensions: motion linearity loss and edge-guided training. The motion linearity loss approximates the motion between two input frames to be linear and regularizes the training. By applying edge-guided training, we further improve results by integrating edge information into training. Both qualitative and quantitative experiments demonstrate that our model outperforms the state-of-the-art methods. The source codes of the proposed method and more experimental results will be available at https://github.com/alex04072000/CyclicGen.


2020 ◽  
Author(s):  
C. Molitor ◽  
T.J. Kurowski ◽  
P.M. Fidalgo de Almeida ◽  
P. Eerolla ◽  
D.J. Spindlow ◽  
...  

AbstractSolanum sitiens is a self-incompatible wild relative of tomato, characterised by salt and drought resistance traits, with the potential to contribute to crop improvement in cultivated tomato. This species has a distinct morphology, classification and ecotype compared to other stress resistant wild tomato relatives such as S. pennellii and S. chilense. Therefore, the availability of a high-quality reference genome for S. sitiens will facilitate the genetic and molecular understanding of salt and drought resistance. Here, we present a de novo genome and transcriptome assembly for S. sitiens (Accession LA1974). A hybrid assembly strategy was followed using Illumina short reads (∼159X coverage) and PacBio long reads (∼44X coverage), generating a total of ∼262 Gbp of DNA sequence; in addition, ∼2,670 Gbp of BioNano data was obtained. A reference genome of 1,245 Mbp, arranged in 1,481 scaffolds with a N50 of 1,826 Mbp was generated. Genome completeness was estimated at 95% using the Benchmarking Universal Single-Copy Orthologs (BUSCO) and the K-mer Analysis Tool (KAT); this is within the range of current high-quality reference genomes for other tomato wild relatives. Additionally, we identified three large inversions compared to S. lycopersicum, containing several drought resistance related genes, such as beta-amylase 1 and YUCCA7.In addition, ∼63 Gbp of RNA-Seq were generated to support the prediction of 31,164 genes from the assembly, and perform a de novo transcriptome. Some of the protein clusters unique to S. sitiens were associated with genes involved in drought and salt resistance, including GLO1 and FQR1.This first reference genome for S. sitiens will provide a valuable resource to progress QTL studies to the gene level, and will assist molecular breeding to improve crop production in water-limited environments.


2020 ◽  
Vol 34 (05) ◽  
pp. 8042-8049
Author(s):  
Tomoyuki Kajiwara ◽  
Biwa Miura ◽  
Yuki Arase

We tackle the low-resource problem in style transfer by employing transfer learning that utilizes abundantly available raw corpora. Our method consists of two steps: pre-training learns to generate a semantically equivalent sentence with an input assured grammaticality, and fine-tuning learns to add a desired style. Pre-training has two options, auto-encoding and machine translation based methods. Pre-training based on AutoEncoder is a simple way to learn these from a raw corpus. If machine translators are available, the model can learn more diverse paraphrasing via roundtrip translation. After these, fine-tuning achieves high-quality paraphrase generation even in situations where only 1k sentence pairs of the parallel corpus for style transfer is available. Experimental results of formality style transfer indicated the effectiveness of both pre-training methods and the method based on roundtrip translation achieves state-of-the-art performance.


Sign in / Sign up

Export Citation Format

Share Document