Improved metagenome assemblies and taxonomic binning using long-read circular consensus sequence data

J. A. Frank; Y. Pan; A. Tooming-Klunderud; V. G. H. Eijsink; A. C. McHardy; A. J. Nederbragt; P. B. Pope

doi:10.1038/srep25373

Improved metagenome assemblies and taxonomic binning using long-read circular consensus sequence data

10.1101/026922 ◽

2015 ◽

Cited By ~ 2

Author(s):

Jeremy A. Frank ◽

Yao Pan ◽

Ave Tooming-Klunderud ◽

Vincent G.H. Eijsink ◽

Alice C. McHardy ◽

...

Keyword(s):

Sequence Data ◽

Consensus Sequence ◽

Dna Assembly ◽

Illumina Hiseq ◽

Average Contig Length ◽

Long Read ◽

And Performance ◽

And Function ◽

Taxonomic Binning ◽

Large Contigs

DNA assembly is a core methodological step in metagenomic pipelines used to study the structure and function within microbial communities. Here we investigate the utility of Pacific Biosciences long and high accuracy circular consensus sequencing (CCS) reads for metagenomics projects. We compared the application and performance of both PacBio CCS and Illumina HiSeq data with assembly and taxonomic binning algorithms using metagenomic samples representing a complex microbial community. Eight SMRT cells produced approximately 94 Mb of CCS reads from a biogas reactor microbiome sample, which averaged 1319 nt in length and 99.7 % accuracy. CCS data assembly generated a comparative number of large contigs greater than 1 kb, to those assembled from a ~190x larger HiSeq dataset (~18 Gb) produced from the same sample (i.e approximately 62 % of total contigs). Hybrid assemblies using PacBio CCS and HiSeq contigs produced improvements in assembly statistics, including an increase in the average contig length and number of large contigs. The incorporation of CCS data produced significant enhancements in taxonomic binning and genome reconstruction of two dominant phylotypes, which assembled and binned poorly using HiSeq data alone. Collectively these results illustrate the value of PacBio CCS reads in certain metagenomics applications.

Genome sequences of human cytomegalovirus strain TB40/E variants propagated in fibroblasts and epithelial cells

Virology Journal ◽

10.1186/s12985-021-01583-3 ◽

2021 ◽

Vol 18 (1) ◽

Author(s):

Ahmed Al Qaffas ◽

Salvatore Camiolo ◽

Mai Vo ◽

Alexis Aguiar ◽

Amine Ourahmane ◽

...

Keyword(s):

Epithelial Cells ◽

Human Cytomegalovirus ◽

Viral Entry ◽

Sequence Data ◽

Laboratory Strain ◽

Serial Passage ◽

Wild Type Virus ◽

Protein Coding ◽

Genetic Changes ◽

Long Read

AbstractThe advent of whole genome sequencing has revealed that common laboratory strains of human cytomegalovirus (HCMV) have major genetic deficiencies resulting from serial passage in fibroblasts. In particular, tropism for epithelial and endothelial cells is lost due to mutations disrupting genes UL128, UL130, or UL131A, which encode subunits of a virion-associated pentameric complex (PC) important for viral entry into these cells but not for entry into fibroblasts. The endothelial cell-adapted strain TB40/E has a relatively intact genome and has emerged as a laboratory strain that closely resembles wild-type virus. However, several heterogeneous TB40/E stocks and cloned variants exist that display a range of sequence and tropism properties. Here, we report the use of PacBio sequencing to elucidate the genetic changes that occurred, both at the consensus level and within subpopulations, upon passaging a TB40/E stock on ARPE-19 epithelial cells. The long-read data also facilitated examination of the linkage between mutations. Consistent with inefficient ARPE-19 cell entry, at least 83% of viral genomes present before adaptation contained changes impacting PC subunits. In contrast, and consistent with the importance of the PC for entry into endothelial and epithelial cells, genomes after adaptation lacked these or additional mutations impacting PC subunits. The sequence data also revealed six single noncoding substitutions in the inverted repeat regions, single nonsynonymous substitutions in genes UL26, UL69, US28, and UL122, and a frameshift truncating gene UL141. Among the changes affecting protein-coding regions, only the one in UL122 was strongly selected. This change, resulting in a D390H substitution in the encoded protein IE2, has been previously implicated in rendering another viral protein, UL84, essential for viral replication in fibroblasts. This finding suggests that IE2, and perhaps its interactions with UL84, have important functions unique to HCMV replication in epithelial cells.

A species-specific satellite DNA from the entomopathogenic nematode Heterorhabditis indicus

Genome ◽

10.1139/g98-005 ◽

1998 ◽

Vol 41 (2) ◽

pp. 148-153 ◽

Cited By ~ 8

Author(s):

Monique Abadon ◽

Eric Grenier ◽

Christian Laumond ◽

Pierre Abad

Keyword(s):

Dna Sequence ◽

Satellite Dna ◽

Tandem Repeats ◽

Sequence Data ◽

Entomopathogenic Nematode ◽

Consensus Sequence ◽

Repeated Sequence ◽

Nucleotide Sequence Analysis ◽

Specific Sequence ◽

Species Specific

An AluI satellite DNA family has been cloned from the entomopathogenic nematode Heterorhabditis indicus. This repeated sequence appears to be an unusually abundant satellite DNA, since it constitutes about 45% of the H. indicus genome. The consensus sequence is 174 nucleotides long and has an A + T content of 56%, with the presence of direct and inverted repeat clusters. DNA sequence data reveal that monomers are quite homogeneous. Such homogeneity suggests that some mechanism is acting to maintain the homogeneity of this satellite DNA, despite its abundance, or that this repeated sequence could have appeared recently in the genome of H. indicus. Hybridization analysis of genomic DNAs from different Heterorhabditis species shows that this satellite DNA sequence is specific to the H. indicus genome. Considering the species specificity and the high copy number of this AluI satellite DNA sequence, it could provide a rapid and powerful tool for identifying H. indicus strains.Key words: AluI repeated DNA, tandem repeats, species-specific sequence, nucleotide sequence analysis.

Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data

Briefings in Bioinformatics ◽

10.1093/bib/bbx147 ◽

2017 ◽

Vol 20 (3) ◽

pp. 866-876 ◽

Cited By ~ 30

Author(s):

Vasanthan Jayakumar ◽

Yasubumi Sakakibara

Keyword(s):

Genome Assembly ◽

Comprehensive Evaluation ◽

Sequence Data ◽

Third Generation ◽

Hybrid Genome ◽

Long Read

Rapid Mycobacterium tuberculosis spoligotyping from uncorrected long reads using Galru

10.1101/2020.05.31.126490 ◽

2020 ◽

Author(s):

Andrew J. Page ◽

Nabil-Fareed Alikhan ◽

Michael Strinden ◽

Thanh Le Viet ◽

Timofey Skvortsov

Keyword(s):

Mycobacterium Tuberculosis ◽

State Of The Art ◽

Sequence Data ◽

Human Pathogen ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Long Reads ◽

Long Read

AbstractSpoligotyping of Mycobacterium tuberculosis provides a subspecies classification of this major human pathogen. Spoligotypes can be predicted from short read genome sequencing data; however, no methods exist for long read sequence data such as from Nanopore or PacBio. We present a novel software package Galru, which can rapidly detect the spoligotype of a Mycobacterium tuberculosis sample from as little as a single uncorrected long read. It allows for near real-time spoligotyping from long read data as it is being sequenced, giving rapid sample typing. We compare it to the existing state of the art software and find it performs identically to the results obtained from short read sequencing data. Galru is freely available from https://github.com/quadram-institute-bioscience/galru under the GPLv3 open source licence.

Multiple Displacement Amplification as a Solution for Low Copy Number Plasmid Sequencing

Frontiers in Microbiology ◽

10.3389/fmicb.2021.617487 ◽

2021 ◽

Vol 12 ◽

Author(s):

Kuan Yao ◽

Narjol González-Escalona ◽

Maria Hoffmann

Keyword(s):

Antibiotic Resistance ◽

Single Molecule ◽

Plasmid Dna ◽

Copy Number ◽

Sequence Data ◽

Multiple Displacement Amplification ◽

Fast Method ◽

Alkaline Lysis ◽

Low Copy Number ◽

Long Read

Plasmids play a major role in bacterial adaptation to environmental stress and often contribute to antibiotic resistance and disease virulence. Although the complete sequence of each plasmid is essential for studying plasmid biology, most antibiotic resistance and virulence plasmids in Salmonella are present only in a low copy number, making extraction and sequencing difficult. Long read sequencing technologies require higher concentrations of DNA to provide optimal results. To resolve this problem, we assessed the sufficiency of multiple displacement amplification (MDA) for replicating Salmonella plasmid DNA to a satisfactory concentration for accurate sequencing and multiplexing. Nine Salmonella enterica isolates, representing nine different serovars carrying plasmids for which sequence data are already available at NCBI, were cultured and their plasmids isolated using an alkaline lysis extraction protocol. We then used the Phi29 polymerase to perform MDA, thereby obtaining enough plasmid DNA for long read sequencing. These amplified plasmids were multiplexed and sequenced on one single molecule, real-time (SMRT) cell with the Pacific Biosciences (Pacbio) Sequel sequencer. We were able to close all Salmonella plasmids (sizes ranged from 38 to 166 Kb) with sequencing coverage from 24 to 2,582X. This protocol, consisting of plasmid isolation, MDA, and multiplex sequencing, is an effective and fast method for closing high-molecular weight and low-copy-number plasmids. This high throughput protocol reduces the time and cost of plasmid closure.

Genomic dynamics of species and mobile genetic elements in a prolonged blaIMP-4-associated carbapenemase outbreak in an Australian hospital

Journal of Antimicrobial Chemotherapy ◽

10.1093/jac/dkz526 ◽

2020 ◽

Vol 75 (4) ◽

pp. 873-882 ◽

Cited By ~ 3

Author(s):

A Kizny Gordon ◽

H T T Phan ◽

S I Lipworth ◽

E Cheong ◽

T Gottlieb ◽

...

Keyword(s):

Resistance Genes ◽

Genetic Relatedness ◽

Sequence Data ◽

Environmental Isolates ◽

Class 1 Integron ◽

Oxford Nanopore ◽

Class 1 ◽

Long Read ◽

Cleaning Methods ◽

Burns Unit

Abstract Background Hospital outbreaks of carbapenemase-producing organisms, such as blaIMP-4-containing organisms, are an increasing threat to patient safety. Objectives To investigate the genomic dynamics of a 10 year (2006–15) outbreak of blaIMP-4-containing organisms in a burns unit in a hospital in Sydney, Australia. Methods All carbapenem-non-susceptible or MDR clinical isolates (2006–15) and a random selection of equivalent or ESBL-producing environmental isolates (2012–15) were sequenced [short-read (Illumina), long-read (Oxford Nanopore Technology)]. Sequence data were used to assess genetic relatedness of isolates (Mash; mapping and recombination-adjusted phylogenies), perform in silico typing (MLST, resistance genes and plasmid replicons) and reconstruct a subset of blaIMP plasmids for comparative plasmid genomics. Results A total of 46/58 clinical and 67/96 environmental isolates contained blaIMP-4. All blaIMP-4-positive organisms contained five or more other resistance genes. Enterobacter cloacae was the predominant organism, with 12 other species mainly found in either the environment or patients, some persisting despite several cleaning methods. On phylogenetic analysis there were three genetic clusters of E. cloacae containing both clinical and environmental isolates, and an additional four clusters restricted to either reservoir. blaIMP-4 was mostly found as part of a cassette array (blaIMP-4-qacG2-aacA4-catB3) in a class 1 integron within a previously described IncM2 plasmid (pEl1573), with almost complete conservation of this cassette across the species over the 10 years. Several other plasmids were also implicated, including an IncF plasmid backbone not previously widely described in association with blaIMP-4. Conclusions Genetic backgrounds disseminating blaIMP-4 can persist, diversify and evolve amongst both human and environmental reservoirs during a prolonged outbreak despite intensive prevention efforts.

Improved Genome Assembly and Annotation of the Soybean Aphid (Aphis glycines Matsumura)

G3 Genes|Genome|Genetics ◽

10.1534/g3.119.400954 ◽

2020 ◽

Vol 10 (3) ◽

pp. 899-906 ◽

Cited By ~ 8

Author(s):

Thomas C. Mathers

Keyword(s):

Genome Assembly ◽

Sequence Data ◽

Parasitoid Wasp ◽

Single Copy ◽

Aphid Species ◽

Soybean Aphid ◽

Aphis Glycines ◽

Data Sets ◽

Conserved Genes ◽

Long Read

Aphids are an economically important insect group due to their role as plant disease vectors. Despite this economic impact, genomic resources have only been generated for a small number of aphid species. The soybean aphid (Aphis glycines Matsumura) was the third aphid species to have its genome sequenced and the first to use long-read sequence data. However, version 1 of the soybean aphid genome assembly has low contiguity (contig N50 = 57 Kb, scaffold N50 = 174 Kb), poor representation of conserved genes and the presence of genomic scaffolds likely derived from parasitoid wasp contamination. Here, I use recently developed methods to reassemble the soybean aphid genome. The version 2 genome assembly is highly contiguous, containing half of the genome in only 40 scaffolds (contig N50 = 2.00 Mb, scaffold N50 = 2.51 Mb) and contains 11% more conserved single-copy arthropod genes than version 1. To demonstrate the utility of this improved assembly, I identify a region of conserved synteny between aphids and Drosophila containing members of the Osiris gene family that was split over multiple scaffolds in the original assembly. The improved genome assembly and annotation of A. glycines demonstrates the benefit of applying new methods to old data sets and will provide a useful resource for future comparative genome analysis of aphids.

Highly accurate long-read HiFi sequencing data for five complex genomes

Scientific Data ◽

10.1038/s41597-020-00743-4 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Ting Hon ◽

Kristin Mars ◽

Greg Young ◽

Yu-Chih Tsai ◽

Joseph W. Karalius ◽

...

Keyword(s):

Sequence Data ◽

Genome Structure ◽

Data Sets ◽

Sequencing Data ◽

Complex Samples ◽

Bioinformatic Tools ◽

Long Reads ◽

Sequencing Method ◽

Sample Data ◽

Long Read

AbstractThe PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets with read lengths averaging 10–25 kb and accuracies greater than 99.5%. These accurate long reads can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers, and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II System.

Evi-1, a murine zinc finger proto-oncogene, encodes a sequence-specific DNA-binding protein.

Molecular and Cellular Biology ◽

10.1128/mcb.11.5.2665 ◽

1991 ◽

Vol 11 (5) ◽

pp. 2665-2674 ◽

Cited By ~ 69

Author(s):

A S Perkins ◽

R Fishel ◽

N A Jenkins ◽

N G Copeland

Keyword(s):

Dna Binding ◽

Binding Site ◽

Zinc Finger ◽

Sequence Data ◽

Binding Protein ◽

Consensus Sequence ◽

Dna Binding Protein ◽

Polymerase Chain Reaction Method ◽

Nucleotide Sequence Data ◽

Carboxy Terminal

Evi-1 was originally identified as a common site of viral integration in murine myeloid tumors. Evi-1 encodes a 120-kDa polypeptide containing 10 zinc finger motifs located in two domains 380 amino acids apart and an acidic domain located carboxy terminal to the second set of zinc fingers. These features suggest that Evi-1 is a site-specific DNA-binding protein involved in the regulation of RNA transcription. We have purified Evi-1 protein from E. coli and have employed a gel shift-polymerase chain reaction method using random oligonucleotides to identify a high-affinity binding site for Evi-1. The consensus sequence for this binding site is TGACAAGATAA. Evi-1 protein specifically protects this motif from DNase I digestion. By searching the nucleotide sequence data bases, we have found this binding site both in sequences 5' to genes in putative or known regulatory regions and within intron sequences.