scholarly journals Improved metagenome assemblies and taxonomic binning using long-read circular consensus sequence data

2016 ◽  
Vol 6 (1) ◽  
Author(s):  
J. A. Frank ◽  
Y. Pan ◽  
A. Tooming-Klunderud ◽  
V. G. H. Eijsink ◽  
A. C. McHardy ◽  
...  

2015 ◽  
Author(s):  
Jeremy A. Frank ◽  
Yao Pan ◽  
Ave Tooming-Klunderud ◽  
Vincent G.H. Eijsink ◽  
Alice C. McHardy ◽  
...  

DNA assembly is a core methodological step in metagenomic pipelines used to study the structure and function within microbial communities. Here we investigate the utility of Pacific Biosciences long and high accuracy circular consensus sequencing (CCS) reads for metagenomics projects. We compared the application and performance of both PacBio CCS and Illumina HiSeq data with assembly and taxonomic binning algorithms using metagenomic samples representing a complex microbial community. Eight SMRT cells produced approximately 94 Mb of CCS reads from a biogas reactor microbiome sample, which averaged 1319 nt in length and 99.7 % accuracy. CCS data assembly generated a comparative number of large contigs greater than 1 kb, to those assembled from a ~190x larger HiSeq dataset (~18 Gb) produced from the same sample (i.e approximately 62 % of total contigs). Hybrid assemblies using PacBio CCS and HiSeq contigs produced improvements in assembly statistics, including an increase in the average contig length and number of large contigs. The incorporation of CCS data produced significant enhancements in taxonomic binning and genome reconstruction of two dominant phylotypes, which assembled and binned poorly using HiSeq data alone. Collectively these results illustrate the value of PacBio CCS reads in certain metagenomics applications.



2021 ◽  
Vol 18 (1) ◽  
Author(s):  
Ahmed Al Qaffas ◽  
Salvatore Camiolo ◽  
Mai Vo ◽  
Alexis Aguiar ◽  
Amine Ourahmane ◽  
...  

AbstractThe advent of whole genome sequencing has revealed that common laboratory strains of human cytomegalovirus (HCMV) have major genetic deficiencies resulting from serial passage in fibroblasts. In particular, tropism for epithelial and endothelial cells is lost due to mutations disrupting genes UL128, UL130, or UL131A, which encode subunits of a virion-associated pentameric complex (PC) important for viral entry into these cells but not for entry into fibroblasts. The endothelial cell-adapted strain TB40/E has a relatively intact genome and has emerged as a laboratory strain that closely resembles wild-type virus. However, several heterogeneous TB40/E stocks and cloned variants exist that display a range of sequence and tropism properties. Here, we report the use of PacBio sequencing to elucidate the genetic changes that occurred, both at the consensus level and within subpopulations, upon passaging a TB40/E stock on ARPE-19 epithelial cells. The long-read data also facilitated examination of the linkage between mutations. Consistent with inefficient ARPE-19 cell entry, at least 83% of viral genomes present before adaptation contained changes impacting PC subunits. In contrast, and consistent with the importance of the PC for entry into endothelial and epithelial cells, genomes after adaptation lacked these or additional mutations impacting PC subunits. The sequence data also revealed six single noncoding substitutions in the inverted repeat regions, single nonsynonymous substitutions in genes UL26, UL69, US28, and UL122, and a frameshift truncating gene UL141. Among the changes affecting protein-coding regions, only the one in UL122 was strongly selected. This change, resulting in a D390H substitution in the encoded protein IE2, has been previously implicated in rendering another viral protein, UL84, essential for viral replication in fibroblasts. This finding suggests that IE2, and perhaps its interactions with UL84, have important functions unique to HCMV replication in epithelial cells.



Genome ◽  
1998 ◽  
Vol 41 (2) ◽  
pp. 148-153 ◽  
Author(s):  
Monique Abadon ◽  
Eric Grenier ◽  
Christian Laumond ◽  
Pierre Abad

An AluI satellite DNA family has been cloned from the entomopathogenic nematode Heterorhabditis indicus. This repeated sequence appears to be an unusually abundant satellite DNA, since it constitutes about 45% of the H. indicus genome. The consensus sequence is 174 nucleotides long and has an A + T content of 56%, with the presence of direct and inverted repeat clusters. DNA sequence data reveal that monomers are quite homogeneous. Such homogeneity suggests that some mechanism is acting to maintain the homogeneity of this satellite DNA, despite its abundance, or that this repeated sequence could have appeared recently in the genome of H. indicus. Hybridization analysis of genomic DNAs from different Heterorhabditis species shows that this satellite DNA sequence is specific to the H. indicus genome. Considering the species specificity and the high copy number of this AluI satellite DNA sequence, it could provide a rapid and powerful tool for identifying H. indicus strains.Key words: AluI repeated DNA, tandem repeats, species-specific sequence, nucleotide sequence analysis.





2020 ◽  
Author(s):  
Andrew J. Page ◽  
Nabil-Fareed Alikhan ◽  
Michael Strinden ◽  
Thanh Le Viet ◽  
Timofey Skvortsov

AbstractSpoligotyping of Mycobacterium tuberculosis provides a subspecies classification of this major human pathogen. Spoligotypes can be predicted from short read genome sequencing data; however, no methods exist for long read sequence data such as from Nanopore or PacBio. We present a novel software package Galru, which can rapidly detect the spoligotype of a Mycobacterium tuberculosis sample from as little as a single uncorrected long read. It allows for near real-time spoligotyping from long read data as it is being sequenced, giving rapid sample typing. We compare it to the existing state of the art software and find it performs identically to the results obtained from short read sequencing data. Galru is freely available from https://github.com/quadram-institute-bioscience/galru under the GPLv3 open source licence.



2021 ◽  
Vol 12 ◽  
Author(s):  
Kuan Yao ◽  
Narjol González-Escalona ◽  
Maria Hoffmann

Plasmids play a major role in bacterial adaptation to environmental stress and often contribute to antibiotic resistance and disease virulence. Although the complete sequence of each plasmid is essential for studying plasmid biology, most antibiotic resistance and virulence plasmids in Salmonella are present only in a low copy number, making extraction and sequencing difficult. Long read sequencing technologies require higher concentrations of DNA to provide optimal results. To resolve this problem, we assessed the sufficiency of multiple displacement amplification (MDA) for replicating Salmonella plasmid DNA to a satisfactory concentration for accurate sequencing and multiplexing. Nine Salmonella enterica isolates, representing nine different serovars carrying plasmids for which sequence data are already available at NCBI, were cultured and their plasmids isolated using an alkaline lysis extraction protocol. We then used the Phi29 polymerase to perform MDA, thereby obtaining enough plasmid DNA for long read sequencing. These amplified plasmids were multiplexed and sequenced on one single molecule, real-time (SMRT) cell with the Pacific Biosciences (Pacbio) Sequel sequencer. We were able to close all Salmonella plasmids (sizes ranged from 38 to 166 Kb) with sequencing coverage from 24 to 2,582X. This protocol, consisting of plasmid isolation, MDA, and multiplex sequencing, is an effective and fast method for closing high-molecular weight and low-copy-number plasmids. This high throughput protocol reduces the time and cost of plasmid closure.



2020 ◽  
Vol 75 (4) ◽  
pp. 873-882 ◽  
Author(s):  
A Kizny Gordon ◽  
H T T Phan ◽  
S I Lipworth ◽  
E Cheong ◽  
T Gottlieb ◽  
...  

Abstract Background Hospital outbreaks of carbapenemase-producing organisms, such as blaIMP-4-containing organisms, are an increasing threat to patient safety. Objectives To investigate the genomic dynamics of a 10 year (2006–15) outbreak of blaIMP-4-containing organisms in a burns unit in a hospital in Sydney, Australia. Methods All carbapenem-non-susceptible or MDR clinical isolates (2006–15) and a random selection of equivalent or ESBL-producing environmental isolates (2012–15) were sequenced [short-read (Illumina), long-read (Oxford Nanopore Technology)]. Sequence data were used to assess genetic relatedness of isolates (Mash; mapping and recombination-adjusted phylogenies), perform in silico typing (MLST, resistance genes and plasmid replicons) and reconstruct a subset of blaIMP plasmids for comparative plasmid genomics. Results A total of 46/58 clinical and 67/96 environmental isolates contained blaIMP-4. All blaIMP-4-positive organisms contained five or more other resistance genes. Enterobacter cloacae was the predominant organism, with 12 other species mainly found in either the environment or patients, some persisting despite several cleaning methods. On phylogenetic analysis there were three genetic clusters of E. cloacae containing both clinical and environmental isolates, and an additional four clusters restricted to either reservoir. blaIMP-4 was mostly found as part of a cassette array (blaIMP-4-qacG2-aacA4-catB3) in a class 1 integron within a previously described IncM2 plasmid (pEl1573), with almost complete conservation of this cassette across the species over the 10 years. Several other plasmids were also implicated, including an IncF plasmid backbone not previously widely described in association with blaIMP-4. Conclusions Genetic backgrounds disseminating blaIMP-4 can persist, diversify and evolve amongst both human and environmental reservoirs during a prolonged outbreak despite intensive prevention efforts.



2020 ◽  
Vol 10 (3) ◽  
pp. 899-906 ◽  
Author(s):  
Thomas C. Mathers

Aphids are an economically important insect group due to their role as plant disease vectors. Despite this economic impact, genomic resources have only been generated for a small number of aphid species. The soybean aphid (Aphis glycines Matsumura) was the third aphid species to have its genome sequenced and the first to use long-read sequence data. However, version 1 of the soybean aphid genome assembly has low contiguity (contig N50 = 57 Kb, scaffold N50 = 174 Kb), poor representation of conserved genes and the presence of genomic scaffolds likely derived from parasitoid wasp contamination. Here, I use recently developed methods to reassemble the soybean aphid genome. The version 2 genome assembly is highly contiguous, containing half of the genome in only 40 scaffolds (contig N50 = 2.00 Mb, scaffold N50 = 2.51 Mb) and contains 11% more conserved single-copy arthropod genes than version 1. To demonstrate the utility of this improved assembly, I identify a region of conserved synteny between aphids and Drosophila containing members of the Osiris gene family that was split over multiple scaffolds in the original assembly. The improved genome assembly and annotation of A. glycines demonstrates the benefit of applying new methods to old data sets and will provide a useful resource for future comparative genome analysis of aphids.



2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Ting Hon ◽  
Kristin Mars ◽  
Greg Young ◽  
Yu-Chih Tsai ◽  
Joseph W. Karalius ◽  
...  

AbstractThe PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets with read lengths averaging 10–25 kb and accuracies greater than 99.5%. These accurate long reads can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers, and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II System.



1991 ◽  
Vol 11 (5) ◽  
pp. 2665-2674 ◽  
Author(s):  
A S Perkins ◽  
R Fishel ◽  
N A Jenkins ◽  
N G Copeland

Evi-1 was originally identified as a common site of viral integration in murine myeloid tumors. Evi-1 encodes a 120-kDa polypeptide containing 10 zinc finger motifs located in two domains 380 amino acids apart and an acidic domain located carboxy terminal to the second set of zinc fingers. These features suggest that Evi-1 is a site-specific DNA-binding protein involved in the regulation of RNA transcription. We have purified Evi-1 protein from E. coli and have employed a gel shift-polymerase chain reaction method using random oligonucleotides to identify a high-affinity binding site for Evi-1. The consensus sequence for this binding site is TGACAAGATAA. Evi-1 protein specifically protects this motif from DNase I digestion. By searching the nucleotide sequence data bases, we have found this binding site both in sequences 5' to genes in putative or known regulatory regions and within intron sequences.



Sign in / Sign up

Export Citation Format

Share Document