Extraction of near-complete genomes from metagenomic samples: a new service in PATRIC

AbstractBackgroundLarge volumes of metagenomic samples are being processed and submitted to PATRIC for analysis as reads or assembled contigs. Effective analysis of these samples requires solutions to a number of problems, including the binning of assembled, mixed, metagenomically-derived contigs into taxonomic units.DescriptionThe PATRIC metagenome binning service utilizes the PATRIC database to furnish a large, diverse set of reference genomes. Reference genomes are assigned based on the presence of single-copy universal marker proteins in the sample, and contigs are assigned to the bin corresponding to the most similar reference genome. Each set of binned contigs represents a draft genome that will be annotated by RASTtk in PATRIC. A structured-language binning report is provided containing quality measurements and taxonomic information about the contig bins.ConclusionWe provide a new service for rapid and interpretable metagenomic contig binning and annotation in PATRIC.

Download Full-text

Supervised extraction of near-complete genomes from metagenomic samples: A new service in PATRIC

PLoS ONE ◽

10.1371/journal.pone.0250092 ◽

2021 ◽

Vol 16 (4) ◽

pp. e0250092

Author(s):

Bruce Parrello ◽

Rory Butler ◽

Philippe Chlenski ◽

Gordon D. Pusch ◽

Ross Overbeek

Keyword(s):

Draft Genome ◽

Single Copy ◽

Metagenomic Data ◽

Use Case ◽

High Quality ◽

Complete Genomes ◽

Downstream Analysis ◽

Low Coverage ◽

Derived Data ◽

Reference Genomes

Large amounts of metagenomically-derived data are submitted to PATRIC for analysis. In the future, we expect even more jobs submitted to PATRIC will use metagenomic data. One in-demand use case is the extraction of near-complete draft genomes from assembled contigs of metagenomic origin. The PATRIC metagenome binning service utilizes the PATRIC database to furnish a large, diverse set of reference genomes. We provide a new service for supervised extraction and annotation of high-quality, near-complete genomes from metagenomically-derived contigs. Reference genomes are assigned to putative draft genome bins based on the presence of single-copy universal marker roles in the sample, and contigs are sorted into these bins by their similarity to reference genomes in PATRIC. Each set of binned contigs represents a draft genome that will be annotated by RASTtk in PATRIC. A structured-language binning report is provided containing quality measurements and taxonomic information about the contig bins. The PATRIC metagenome binning service emphasizes extraction of high-quality genomes for downstream analysis using other PATRIC tools and services. Due to its supervised nature, the binning service is not appropriate for mining novel or extremely low-coverage genomes from metagenomic samples.

Download Full-text

Solyntus, the New Highly Contiguous Reference Genome for Potato (Solanum tuberosum)

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401550 ◽

2020 ◽

Vol 10 (10) ◽

pp. 3489-3495

Author(s):

Natascha van Lieshout ◽

Ate van der Burgt ◽

Michiel E. de Vries ◽

Menno ter Maat ◽

David Eickholt ◽

...

Keyword(s):

Solanum Tuberosum ◽

Reference Genome ◽

De Novo ◽

Draft Genome ◽

Single Copy ◽

Rapid Expansion ◽

Potato Genome ◽

Homozygous Diploid ◽

Gene Orthologs ◽

Reference Genomes

With the rapid expansion of the application of genomics and sequencing in plant breeding, there is a constant drive for better reference genomes. In potato (Solanum tuberosum), the third largest food crop in the world, the related species S. phureja, designated “DM”, has been used as the most popular reference genome for the last 10 years. Here, we introduce the de novo sequenced genome of Solyntus as the next standard reference in potato genome studies. A true Solanum tuberosum made up of 116 contigs that is also highly homozygous, diploid, vigorous and self-compatible, Solyntus provides a more direct and contiguous reference then ever before available. It was constructed by sequencing with state-of-the-art long and short read technology and assembled with Canu. The 116 contigs were assembled into scaffolds to form each pseudochromosome, with three contigs to 17 contigs per chromosome. This assembly contains 93.7% of the single-copy gene orthologs from the Solanaceae set and has an N50 of 63.7 Mbp. The genome and related files can be found at https://www.plantbreeding.wur.nl/Solyntus/. With the release of this research line and its draft genome we anticipate many exciting developments in (diploid) potato research.

Download Full-text

RFPlasmid: Predicting plasmid sequences from short read assembly data using machine learning

10.1101/2020.07.31.230631 ◽

2020 ◽

Cited By ~ 1

Author(s):

Linda van der Graaf van Bloois ◽

Jaap A. Wagenaar ◽

Aldert L. Zomer

Keyword(s):

Bacterial Species ◽

Draft Genome ◽

Single Copy ◽

Chromosomal Marker ◽

Web Interface ◽

Large Single Copy ◽

Genome Sequences ◽

Multiple Features ◽

E Coli ◽

Marker Proteins

AbstractAntimicrobial resistance (AMR) genes in bacteria are often carried on plasmids and these plasmids can transfer AMR genes between bacteria. For molecular epidemiology purposes and risk assessment, it is important to know if the genes are located on highly transferable plasmids or in the more stable chromosomes. However, draft whole genome sequences are fragmented, making it difficult to discriminate plasmid and chromosomal contigs. Current methods that predict plasmid sequences from draft genome sequences rely on single features, like k-mer composition, circularity of the DNA molecule, copy number or sequence identity to plasmid replication genes, all of which have their drawbacks, especially when faced with large single copy plasmids, which often carry resistance genes. With our newly developed prediction tool RFPlasmid, we use a combination of multiple features, including k-mer composition and databases with plasmid and chromosomal marker proteins, to predict if the likely source of a contig is plasmid or chromosomal. The tool RFPlasmid supports models for 17 different bacterial species, including Campylobacter, E. coli, and Salmonella, and has a species agnostic model for metagenomic assemblies or unsupported organisms. RFPlasmid is available both as standalone tool and via web interface.

Download Full-text

A de novo assembly of the sweet cherry (Prunus avium cv. Tieton) genome using linked-read sequencing technology

PeerJ ◽

10.7717/peerj.9114 ◽

2020 ◽

Vol 8 ◽

pp. e9114 ◽

Cited By ~ 1

Author(s):

Jiawei Wang ◽

Weizhen Liu ◽

Dongzi Zhu ◽

Xiang Zhou ◽

Po Hong ◽

...

Keyword(s):

Sweet Cherry ◽

Prunus Avium ◽

Reference Genome ◽

De Novo ◽

Draft Genome ◽

Single Copy ◽

Sequencing Data ◽

Sequencing Technology ◽

High Quality ◽

Eukaryotic Genes

The sweet cherry (Prunus avium) is one of the most economically important fruit species in the world. However, there is a limited amount of genetic information available for this species, which hinders breeding efforts at a molecular level. We were able to describe a high-quality reference genome assembly and annotation of the diploid sweet cherry (2n = 2x = 16) cv. Tieton using linked-read sequencing technology. We generated over 750 million clean reads, representing 112.63 GB of raw sequencing data. The Supernova assembler produced a more highly-ordered and continuous genome sequence than the current P. avium draft genome, with a contig N50 of 63.65 KB and a scaffold N50 of 2.48 MB. The final scaffold assembly was 280.33 MB in length, representing 82.12% of the estimated Tieton genome. Eight chromosome-scale pseudomolecules were constructed, completing a 214 MB sequence of the final scaffold assembly. De novo, homology-based, and RNA-seq methods were used together to predict 30,975 protein-coding loci. 98.39% of core eukaryotic genes and 97.43% of single copy orthologues were identified in the embryo plant, indicating the completeness of the assembly. Linked-read sequencing technology was effective in constructing a high-quality reference genome of the sweet cherry, which will benefit the molecular breeding and cultivar identification in this species.

Download Full-text

Canfam_GSD: De novo chromosome-length genome assembly of the German Shepherd Dog (Canis lupus familiaris) using a combination of long reads, optical mapping, and Hi-C

GigaScience ◽

10.1093/gigascience/giaa027 ◽

2020 ◽

Vol 9 (4) ◽

Cited By ~ 5

Author(s):

Matt A Field ◽

Benjamin D Rosen ◽

Olga Dudchenko ◽

Eva K F Chan ◽

Andre E Minoche ◽

...

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Gene Annotation ◽

Draft Genome ◽

Single Copy ◽

Canis Lupus Familiaris ◽

German Shepherd ◽

German Shepherd Dog ◽

First Choice

Abstract Background The German Shepherd Dog (GSD) is one of the most common breeds on earth and has been bred for its utility and intelligence. It is often first choice for police and military work, as well as protection, disability assistance, and search-and-rescue. Yet, GSDs are well known to be susceptible to a range of genetic diseases that can interfere with their training. Such diseases are of particular concern when they occur later in life, and fully trained animals are not able to continue their duties. Findings Here, we provide the draft genome sequence of a healthy German Shepherd female as a reference for future disease and evolutionary studies. We generated this improved canid reference genome (CanFam_GSD) utilizing a combination of Pacific Bioscience, Oxford Nanopore, 10X Genomics, Bionano, and Hi-C technologies. The GSD assembly is ∼80 times as contiguous as the current canid reference genome (20.9 vs 0.267 Mb contig N50), containing far fewer gaps (306 vs 23,876) and fewer scaffolds (429 vs 3,310) than the current canid reference genome CanFamv3.1. Two chromosomes (4 and 35) are assembled into single scaffolds with no gaps. BUSCO analyses of the genome assembly results show that 93.0% of the conserved single-copy genes are complete in the GSD assembly compared with 92.2% for CanFam v3.1. Homology-based gene annotation increases this value to ∼99%. Detailed examination of the evolutionarily important pancreatic amylase region reveals that there are most likely 7 copies of the gene, indicative of a duplication of 4 ancestral copies and the disruption of 1 copy. Conclusions GSD genome assembly and annotation were produced with major improvement in completeness, continuity, and quality over the existing canid reference. This resource will enable further research related to canine diseases, the evolutionary relationships of canids, and other aspects of canid biology.

Download Full-text

Scaffolding Contigs Using Multiple Reference Genomes

Computational Biology and Chemistry ◽

10.5772/intechopen.93456 ◽

2020 ◽

Author(s):

Yi-Kung Shieh ◽

Shu-Cheng Liu ◽

Chin Lung Lu

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

State Of The Art ◽

Draft Genome ◽

Evolutionary Relationship ◽

The State ◽

Target Genome ◽

Multiple Reference ◽

Reference Genomes

Scaffolding is an important step of the genome assembly and its function is to order and orient the contigs in the assembly of a draft genome into larger scaffolds. Several single reference-based scaffolders have currently been proposed. However, a single reference genome may not be sufficient alone for a scaffolder to correctly scaffold a target draft genome, especially when the target genome and the reference genome have distant evolutionary relationship or some rearrangements. This motivates researchers to develop the so-called multiple reference-based scaffolders that can utilize multiple reference genomes, which may provide different but complementary types of scaffolding information, to scaffold the target draft genome. In this chapter, we will review some of the state-of-the-art multiple reference-based scaffolders, such as Ragout, MeDuSa and Multi-CAR, and give a complete introduction to Multi-CSAR, an improved extension of Multi-CAR.

Download Full-text

RFPlasmid: predicting plasmid sequences from short-read assembly data using machine learning

Microbial Genomics ◽

10.1099/mgen.0.000683 ◽

2021 ◽

Vol 7 (11) ◽

Author(s):

Linda van der Graaf-van Bloois ◽

Jaap A. Wagenaar ◽

Aldert L. Zomer

Keyword(s):

Draft Genome ◽

Single Copy ◽

Chromosomal Marker ◽

Web Interface ◽

Large Single Copy ◽

Genome Sequences ◽

Multiple Features ◽

Content Type ◽

Link Type ◽

Marker Proteins

Antimicrobial-resistance (AMR) genes in bacteria are often carried on plasmids and these plasmids can transfer AMR genes between bacteria. For molecular epidemiology purposes and risk assessment, it is important to know whether the genes are located on highly transferable plasmids or in the more stable chromosomes. However, draft whole-genome sequences are fragmented, making it difficult to discriminate plasmid and chromosomal contigs. Current methods that predict plasmid sequences from draft genome sequences rely on single features, like k-mer composition, circularity of the DNA molecule, copy number or sequence identity to plasmid replication genes, all of which have their drawbacks, especially when faced with large single-copy plasmids, which often carry resistance genes. With our newly developed prediction tool RFPlasmid, we use a combination of multiple features, including k-mer composition and databases with plasmid and chromosomal marker proteins, to predict whether the likely source of a contig is plasmid or chromosomal. The tool RFPlasmid supports models for 17 different bacterial taxa, including Campylobacter , Escherichia coli and Salmonella , and has a taxon agnostic model for metagenomic assemblies or unsupported organisms. RFPlasmid is available both as a standalone tool and via a web interface.

Download Full-text

Genome Assembly of the Canadian Two-row Malting Barley Cultivar AAC Synergy

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab031 ◽

2021 ◽

Author(s):

Wayne Xu ◽

James R Tucker ◽

Wubishet A Bekele ◽

Frank M You ◽

Yong-Bi Fu ◽

...

Keyword(s):

Reference Genome ◽

Single Copy ◽

Barley Cultivar ◽

Malting Barley ◽

Orthologous Genes ◽

Hordeum Vulgare L ◽

Chromosome Conformation ◽

Mate Pair ◽

Genome Assemblies ◽

First Time

Abstract Barley (Hordeum vulgare L.) is one of the most important global crops. The six-row barley cultivar Morex reference genome has been used by the barley research community worldwide. However, this reference genome can have limitations when used for genomic and genetic diversity analysis studies, gene discovery, and marker development when working in two-row germplasm that is more common to Canadian barley. Here we assembled, for the first time, the genome sequence of a Canadian two-row malting barley, cultivar AAC Synergy. We applied deep Illumina paired-end reads, long mate-pair reads, PacBio sequences, 10X chromium linked read libraries, and chromosome conformation capture sequencing (Hi-C) to generate a contiguous assembly. The genome assembled from super-scaffolds had a size of 4.85 Gb, N50 of 2.32 Mb and an estimated 93.9% of complete genes from a plant database (BUSCO, benchmarking universal single-copy orthologous genes). After removal of small scaffolds (< 300 Kb), the assembly was arranged into pseudomolecules of 4.14 Gb in size with seven chromosomes plus unanchored scaffolds. The completeness and annotation of the assembly were assessed by comparing it with the updated version of six-row Morex and recently released two-row Golden Promise genome assemblies.

Download Full-text

Reference flow: reducing reference bias using multiple population genomes

Genome Biology ◽

10.1186/s13059-020-02229-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Nae-Chyun Chen ◽

Brad Solomon ◽

Taher Mun ◽

Sheila Iyer ◽

Ben Langmead

Keyword(s):

Genetic Variation ◽

Reference Genome ◽

Alignment Method ◽

Sequencing Data ◽

Computational Overhead ◽

Reference Flow ◽

Multiple Population ◽

Reference Bias ◽

Flow Alignment ◽

Reference Genomes

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.

Download Full-text

AStrap: identification of alternative splicing from transcript sequences without a reference genome

Bioinformatics ◽

10.1093/bioinformatics/bty1008 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2654-2656 ◽

Cited By ~ 5

Author(s):

Guoli Ji ◽

Wenbin Ye ◽

Yaru Su ◽

Moliang Chen ◽

Guangzao Huang ◽

...

Keyword(s):

Machine Learning ◽

Alternative Splicing ◽

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Supplementary Information ◽

Model Organisms ◽

Sequencing Data ◽

Extensive Evaluation ◽

Reference Genomes

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text