VSEARCH: a versatile open source tool for metagenomics

Background. VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing metagenomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use. Methods. When searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads. Results. VSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based or de novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e. format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available at https://github.com/torognes/vsearch under either the BSD 2-clause license or the GNU General Public License version 3.0. Discussion. VSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.

Download Full-text

VSEARCH: a versatile open source tool for metagenomics

10.7287/peerj.preprints.2409 ◽

2016 ◽

Cited By ~ 9

Author(s):

Torbjørn Rognes ◽

Tomáš Flouri ◽

Ben Nichols ◽

Christopher Quince ◽

Frédéric Mahé

Keyword(s):

Open Source ◽

De Novo ◽

Sequence Data ◽

Pairwise Alignment ◽

Low Complexity ◽

Nucleotide Sequences ◽

Global Alignment ◽

Nucleotide Sequence Data ◽

Fastq File ◽

Target Sequences

Download Full-text

VSEARCH: a versatile open source tool for metagenomics

PeerJ ◽

10.7717/peerj.2584 ◽

2016 ◽

Vol 4 ◽

pp. e2584 ◽

Cited By ~ 2221

Author(s):

Torbjørn Rognes ◽

Tomáš Flouri ◽

Ben Nichols ◽

Christopher Quince ◽

Frédéric Mahé

Keyword(s):

Open Source ◽

Population Genomics ◽

De Novo ◽

Sequence Data ◽

Pairwise Alignment ◽

Nucleotide Sequences ◽

Global Alignment ◽

Nucleotide Sequence Data ◽

Fastq File ◽

Target Sequences

BackgroundVSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use.MethodsWhen searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads.ResultsVSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based orde novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e., format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available athttps://github.com/torognes/vsearchunder either the BSD 2-clause license or the GNU General Public License version 3.0.DiscussionVSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.

Download Full-text

Phylogenetic relationships among Hepatozoon species from snakes, frogs and mosquitoes of Ontario, Canada, determined by ITS-1 nucleotide sequences and life-cycle, morphological and developmental characteristicsfn1fn1Note: Nucleotide sequence data reported in this paper are in the embl, GenBankTM and DDJB databases under accession numbers AF110241–AF110249.

International Journal for Parasitology ◽

10.1016/s0020-7519(98)00198-2 ◽

1999 ◽

Vol 29 (2) ◽

pp. 293-304 ◽

Cited By ~ 22

Author(s):

Todd G Smith ◽

Betty Kim ◽

Sherwin S Desser

Keyword(s):

Life Cycle ◽

Nucleotide Sequence ◽

Phylogenetic Relationships ◽

Sequence Data ◽

Nucleotide Sequences ◽

Nucleotide Sequence Data ◽

Hepatozoon Species

Download Full-text

MECAT: an ultra-fast mapping, error correction andde novoassembly tool for single-molecule sequencing reads

10.1101/089250 ◽

2016 ◽

Cited By ~ 2

Author(s):

Chuan-Le Xiao ◽

Ying Chen ◽

Shang-qian Xie ◽

Kai-Ning Chen ◽

Yan Wang ◽

...

Keyword(s):

Error Correction ◽

Single Molecule ◽

De Novo ◽

Computational Cost ◽

Pairwise Alignment ◽

Global Alignment ◽

Chinese Han ◽

Celera Assembler ◽

Reference Quality ◽

Molecular Sequencing

ABSTRACTThe high computational cost of current assembly methods for the long, noisy single molecular sequencing (SMS) reads has prevented them from assembling large genomes. We introduce an ultra-fast alignment method based on a novel global alignment score. For large human SMS data, our method is 7X faster than MHAP for pairwise alignment and 15X faster than BLASR for reference mapping. We develop a Mapping, Error Correction and de novo Assembly Tool (MECAT) by integrating our new alignment and error correction methods, with the Celera Assembler. MECAT is capable of producing high qualityde novoassembly of large genome from SMS reads with low computational cost. MECAT produces reference-quality assemblies ofSaccharomyces cerevisiae,Arabidopsis thaliana,Drosophila melanogasterand reconstructs the human CHM1 genome with 15% longer NG50 in only 7600 CPU core hours using 54X SMS reads and a Chinese Han genome in 19200 CPU core hours using 102X SMS reads.

Download Full-text

MGDb: An analyzed database and a genomic resource of mango (Mangifera Indica L.) cultivars for mango research

10.1101/301358 ◽

2018 ◽

Author(s):

Tayyaba Qamar-ul-Islam ◽

M. Ahmed Khan ◽

Rabia Faizan ◽

Uzma Mahmood

Keyword(s):

De Novo ◽

Sequence Data ◽

Mangifera Indica ◽

Flat File ◽

Web Based ◽

Tropical Fruit ◽

Nucleotide Sequence Data ◽

Homologous Genes ◽

De Novo Sequence Assembly ◽

Genomic Resource

AbstractMango is one of the famous and fifth most important subtropical/tropical fruit crops worldwide with the production centered in India and South-East Asia. Recently, there has been a worldwide interest in mango genomics to produce tools for Marker Assisted Selection and trait association. There are no web-based analyzed genomic resources available for mango particularly. Hence a complete mango genomic resource was required for improvement in research and management of mango germplasm. In this project, we have done comparative transcriptome analysis of four mango cultivars i.e. cv. Langra, cv. Zill, cv. Shelly and cv. Kent from Pakistan, China, Israel, and Mexico respectively. The raw data is obtained through De-novo sequence assembly which generated 30,953-85,036 unigenes from RNA-Seq datasets of mango cultivars. The project is aimed to provide the scientific community and general public a mango genomic resource and allow the user to examine their data against our analyzed mango genome databases of four cultivars (cv. Langra, cv. Zill, cv. Shelly and cv. Kent). A mango web genomic resource MGdb, is based on 3-tier architecture, developed using Python, flat file database, and JavaScript. It contains the information of predicted genes of the whole genome, the unigenes annotated by homologous genes in other species, and GO (Gene Ontology) terms which provide a glimpse of the traits in which they are involved. This web genomic resource can be of immense use in the assessment of the research, development of the medicines, understanding genetics and provides useful bioinformatics solution for analysis of nucleotide sequence data. We report here world’s first web-based genomic resource particularly of mango for genetic improvement and management of mango genome.

Download Full-text

Molecular Analysis of CAP59 Gene Sequences from Five Serotypes of Cryptococcus neoformans

Journal of Clinical Microbiology ◽

10.1128/jcm.38.3.992-995.2000 ◽

2000 ◽

Vol 38 (3) ◽

pp. 992-995 ◽

Cited By ~ 15

Author(s):

Yuka Nakamura ◽

Rui Kano ◽

Shinichi Watanabe ◽

Atsuhiko Hasegawa

Keyword(s):

Cryptococcus Neoformans ◽

Molecular Analysis ◽

Phylogenetic Relationships ◽

Mixed Type ◽

Sequence Data ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Nucleotide Sequence Data ◽

Serotype A ◽

Serotype B

The nucleotide sequences of CAP59 genes from five serotypes of Cryptococcus neoformans were analyzed for their phylogenetic relationships. Approximately 600-bp genomic DNA fragments of the CAP59 gene were amplified from each isolate by PCR and sequenced. The CAP59 nucleotide sequences of C. neoformans showed more than 90% similarity among the five serotypes. By phylogenetic analysis, their sequences were divided into three clusters: serotypes A and AD, serotypes B and C, and serotype D. In addition, the results of reduced amino acid sequences were similar to the nucleotide sequence data. These data revealed that serotype AD was genetically close to serotype A rather than serotype D, although it had been considered to be a mixed type of serotype A and D by serological analysis. Furthermore, the nucleotide sequences of the serotype B and C isolates of C. neoformanswere very similar to each other. These results indicated that serotype B and C isolates belonging to C. neoformans var.gattii were genetically homogeneous and closely related. The molecular analysis of the CAP59 gene will provide useful information for the differentiation of serotypes of C. neoformans and for an understanding of their phylogenetic relationships.

Download Full-text

A long reads-based de-novo assembly of the genome of the Arlee homozygous line reveals chromosomal rearrangements in rainbow trout

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab052 ◽

2021 ◽

Author(s):

Guangtu Gao ◽

Susana Magadan ◽

Geoffrey C Waldbieser ◽

Ramey C Youngblood ◽

Paul A Wheeler ◽

...

Keyword(s):

Rainbow Trout ◽

Chromosome Number ◽

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Structural Variations ◽

High Coverage ◽

Haploid Chromosome Number ◽

Long Reads

Abstract Currently, there is still a need to improve the contiguity of the rainbow trout reference genome and to use multiple genetic backgrounds that will represent the genetic diversity of this species. The Arlee doubled haploid line was originated from a domesticated hatchery strain that was originally collected from the northern California coast. The Canu pipeline was used to generate the Arlee line genome de-novo assembly from high coverage PacBio long-reads sequence data. The assembly was further improved with Bionano optical maps and Hi-C proximity ligation sequence data to generate 32 major scaffolds corresponding to the karyotype of the Arlee line (2 N = 64). It is composed of 938 scaffolds with N50 of 39.16 Mb and a total length of 2.33 Gb, of which ∼95% was in 32 chromosome sequences with only 438 gaps between contigs and scaffolds. In rainbow trout the haploid chromosome number can vary from 29 to 32. In the Arlee karyotype the haploid chromosome number is 32 because chromosomes Omy04, 14 and 25 are divided into six acrocentric chromosomes. Additional structural variations that were identified in the Arlee genome included the major inversions on chromosomes Omy05 and Omy20 and additional 15 smaller inversions that will require further validation. This is also the first rainbow trout genome assembly that includes a scaffold with the sex-determination gene (sdY) in the chromosome Y sequence. The utility of this genome assembly is demonstrated through the improved annotation of the duplicated genome loci that harbor the IGH genes on chromosomes Omy12 and Omy13.

Download Full-text