scholarly journals dnabarcoder: an open-source software package for analyzing and predicting DNA sequence similarity cut-offs for fungal sequence identification

Author(s):  
Duong Vu ◽  
Henrik Nilsson ◽  
Gerard Verkley

The accuracy and precision of fungal molecular identification and classification are challenging, particularly in environmental metabarcoding approaches as these often trade accuracy for efficiency given the large data volumes at hand. In most ecological studies, only a single similarity cut-off value is used for sequence identification. This is not sufficient since the most commonly used DNA markers are known to vary widely in terms of inter- and intra-specific variability. We address this problem by presenting a new tool, dnabarcoder, to analyze and predict different local similarity cut-offs for sequence identification for different clades of fungi. For each similarity cut-off in a clade, a confidence measure is computed to evaluate the resolving power of the genetic marker in that clade. Experimental results showed that when analyzing a recently released filamentous fungal ITS DNA barcode dataset of CBS strains from the Westerdijk Fungal Biodiversity Institute, the predicted local similarity cut-offs varied immensely between the clades of the dataset. In addition, most of them had a higher confidence measure than the global similarity cut-off predicted for the whole dataset. When classifying a large public fungal ITS dataset – the UNITE database - against the barcode dataset, the local similarity cut-offs assigned fewer sequences than the traditional cut-offs used in metabarcoding studies. However, the obtained accuracy and precision were significantly improved.

2015 ◽  
Vol 15 (4) ◽  
pp. 286-295 ◽  
Author(s):  
Sebastin Raveendar ◽  
Jung-Ro Lee ◽  
Donghwan Shim ◽  
Gi-An Lee ◽  
Young-Ah Jeon ◽  
...  

AbstractThe genus Vicia L., one of the earliest domesticated plant genera, is a member of the legume tribe Fabeae of the subfamily Papilionoideae (Fabaceae). The taxonomic history of this genus is extensive and controversial, which has hindered the development of taxonomic procedures and made it difficult to identify and share these economically important crop resources. Species identification through DNA barcoding is a valuable taxonomic classification tool. In this study, four DNA barcodes (ITS2, matK, rbcL and psbA-trnH) were evaluated on 110 samples that represented 34 taxonomically best-known species in the Vicia genus. Topologies of the phylogenetic trees based on an individual locus were similar. Individual locus-based analyses could not discriminate closely related Vicia species. We proposed a concatenated data approach to increase the resolving power of ITS2. The DNA barcodes matK, psbA-trnH and rbcL were used as an additional tool for phylogenetic analysis. Among the four barcodes, three-barcode combinations that included psbA-trnH with any two of the other barcodes (ITS2, matK or rbcL) provided the best discrimination among Vicia species. Species discrimination was assessed with bootstrap values and considered successful only when all the conspecific individuals formed a single clade. Through sequencing of these barcodes from additional Vicia accessions, 17 of the 34 known Vicia species could be identified with varying levels of confidence. From our analyses, the combined barcoding markers are useful in the early diagnosis of targeted Vicia species and can provide essential baseline data for conservation strategies, as well as guidance in assembling germplasm collections.


2005 ◽  
Vol 55 (3) ◽  
pp. 1171-1179 ◽  
Author(s):  
Daniel R. Zeigler

Full-length recN and 16S rRNA gene sequences were determined for a collection of 68 strains from the thermophilic Gram-positive genus Geobacillus, members of which have been isolated from geographically and ecologically diverse locations. Phylogenetic treeing methods clustered the isolates into nine sequence similarity groups, regardless of which gene was used for analysis. Several of these groups corresponded unambiguously to known Geobacillus species, whereas others contained two or more type strains from species with validly published names, highlighting a need for a re-assessment of the taxonomy for this genus. For taxonomic analysis of bacteria related at a genus, species or subspecies level, recN sequence comparisons had a resolving power nearly an order or magnitude greater than 16S rRNA gene comparisons. Mutational saturation rendered recN comparisons much less powerful than 16S rRNA gene comparisons for analysis of higher taxa, however. Analysis of recN sequences should prove a powerful tool for assigning strains to species within Geobacillus, and perhaps within other genera as well.


2017 ◽  
Author(s):  
Evan McCartney-Melstad ◽  
Müge Gidiş ◽  
H. Bradley Shaffer

AbstractGenomic data are useful for attaining high resolution in population genetic studies and have become increasingly available for answering questions in biological conservation. We analyzed RADseq data for the protected foothill yellow-legged frog (Rana boylii) throughout its native range in California and Oregon, including many of the same localities included in an earlier study based on mitochondrial DNA. We recovered five primary clades that correspond to geographic regions within California and Oregon, with better resolution and more spatially consistent patterns than the previous study, confirming the increased resolving power of genomic approaches compared to single-locus analyses. Bayesian clustering, PCA and population differentiation with admixture analyses all indicated that approximately half the range of R. boylii consists of a single, relatively uniform population, while regions in the Sierra Nevada and Central Coast Range of California are deeply differentiated genetically. Additionally, a major methodological challenge for large genome organisms, including many amphibians, is deciding on sequence similarity clustering thresholds for population genetic analyses using RADseq data, and we develop a novel set of metrics that allow researchers to set a sequence similarity threshold that maximizes the separation of paralogous regions while minimizing the oversplitting of naturally occurring allelic variation within loci.


2019 ◽  
Author(s):  
N. Tessa Pierce ◽  
Luiz Irber ◽  
Taylor Reiter ◽  
Phillip Brooks ◽  
C. Titus Brown

The sourmash software package uses MinHash-based sketching to create “signatures”, compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under the BSD license at http://github.com/dib-lab/sourmash.


Author(s):  
Huyen-Trang Vu ◽  
Ly Le

Classification of organisms is the primary step in management of biodiversity, breeding, conservation and development of populations and distinguishing adulterant objects. There are many approaches in taxonomic identification, from morphological, PCR-based to sequence-based techniques. Molecular methods give more accurate results than morphological comparisons and are independent of plant stages. PCR-based methods are low-cost but their limited information gives less reproducibility and can only distinguish samples among determined groups. In contrast, in sequence-based methods each nucleotide site is considered as genetic information hence a sequence of nucleotide represents large data, which is highly specific and more stable than PCR bands. Establishment of worldwide DNA library for barcoding is essential. There were previous reviews on screenings and applications of barcodes in different taxa. In this review we discussed common bioinformatics analyses as well as some new improved techniques relying on barcoding approaches.


2021 ◽  
Author(s):  
Yoonjin Kim ◽  
Zhen Guo ◽  
Jeffrey A. Robertson ◽  
Benjamin Reidys ◽  
Ziyan Zhang ◽  
...  

Biological sequence alignment using computational power has received increasing attention as technology develops. It is important to predict if a novel DNA sequence is potentially dangerous by determining its taxonomic identity and functional characteristics through sequence identification. This task can be facilitated by the rapidly increasing amounts of biological data in DNA and protein databases thanks to the corresponding increase in computational and storage costs. Unfortunately, the growth in biological databases has caused difficulty in exploiting this information. EnTrance presents an approach that can expedite the analysis of this large database by employing entropy scaling. This allows scaling with the amount of entropy in the database instead of scaling with the absolute size of the database. Since DNA and protein sequences are biologically meaningful, the space of biological sequences demonstrates the structure exploited by entropy scaling. As biological sequence databases grow, taking advantage of this structure can be extremely beneficial for reducing query times. EnTrance, the entropy scaling search algorithm introduced here, accelerates the biological sequence search exemplified by tools such as BLAST. EnTrance does this by utilizing a two step search approach. In this fashion, EnTrance quickly reduces the number of potential matches before more exhaustively searching the remaining sequences. Tests of EnTrance show that this approach can lead to improved query times. However, constructing the required entropy scaling indices beforehand can be challenging. To improve performance, EnTrance investigates several ideas for accelerating index build time that supports entropy scaling searches. In particular, EnTrance makes full use of the concurrency features of Go language greatly reducing the index build time. Our results identify key tradeoffs and demonstrate that there is potential in using these techniques for sequence similarity searches. Finally, EnTrance returns more matches and higher percentage identity matches when compared with existing tools.


F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 1006 ◽  
Author(s):  
N. Tessa Pierce ◽  
Luiz Irber ◽  
Taylor Reiter ◽  
Phillip Brooks ◽  
C. Titus Brown

The sourmash software package uses MinHash-based sketching to create “signatures”, compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under the BSD license at http://github.com/dib-lab/sourmash.


Genes ◽  
2020 ◽  
Vol 11 (4) ◽  
pp. 445 ◽  
Author(s):  
Adeline Seah ◽  
Marisa C.W. Lim ◽  
Denise McAloose ◽  
Stefan Prost ◽  
Tracie A. Seimon

The ability to sequence a variety of wildlife samples with portable, field-friendly equipment will have significant impacts on wildlife conservation and health applications. However, the only currently available field-friendly DNA sequencer, the MinION by Oxford Nanopore Technologies, has a high error rate compared to standard laboratory-based sequencing platforms and has not been systematically validated for DNA barcoding accuracy for preserved and non-invasively collected tissue samples. We tested whether various wildlife sample types, field-friendly methods, and our clustering-based bioinformatics pipeline, SAIGA, can be used to generate consistent and accurate consensus sequences for species identification. Here, we systematically evaluate variation in cytochrome b sequences amplified from scat, hair, feather, fresh frozen liver, and formalin-fixed paraffin-embedded (FFPE) liver. Each sample was processed by three DNA extraction protocols. For all sample types tested, the MinION consensus sequences matched the Sanger references with 99.29%–100% sequence similarity, even for samples that were difficult to amplify, such as scat and FFPE tissue extracted with Chelex resin. Sequencing errors occurred primarily in homopolymer regions, as identified in previous MinION studies. We demonstrate that it is possible to generate accurate DNA barcode sequences from preserved and non-invasively collected wildlife samples using portable MinION sequencing, creating more opportunities to apply portable sequencing technology for species identification.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11028
Author(s):  
Pilar Soledispa ◽  
Efrén Santos-Ordóñez ◽  
Migdalia Miranda ◽  
Ricardo Pacheco ◽  
Yamilet Irene Gutiérrez Gaiten ◽  
...  

Smilax plants are distributed in tropical, subtropical, and temperate regions in both hemispheres of the world. They are used extensively in traditional medicines in a number of countries. However, morphological and molecular barcodes analysis, which may assist in the taxonomic identification of species, are lacking in Ecuador. In order to evaluate the micromorphological characteristics of these plants, cross sections of Smilax purhampuy leaves were obtained manually. The rhizome powder, which is typically used in traditional medicines, was analyzed for micromorphological characteristics. All samples were clarified with 1% sodium hypochlorite. Tissues were colored with 1% safranin in water and were fixed with glycerinated gelatin. DNA was extracted from the leaves using a modified CTAB method for molecular barcode characterization and PCR was performed using primers to amplify the different loci including the plastid genome regions atpF-atpH spacer, matK gene, rbcL gene, rpoB gene, rpoC1 gene, psbK–psbI spacer, and trnH–psbA spacer; and the nuclear DNA sequence ITS2. A DNA sequence similarity search was performed using BLAST in the GenBank nr database and phylogenetic analysis was performed using the maximum likelihood method according to the best model identified by MEGAX using a bootstrap test with 1,000 replicates. Results showed that the micromorphological evaluation of a leaf cross section depicted a concave arrangement of the central vein, which was more pronounced in the lower section and had a slight protuberance. The micromorphological analysis of the rhizome powder allowed the visualization of a group of cells with variable sizes in the parenchyma and revealed thickened xylematic vessels associated with other elements of the vascular system. Specific amplicons were detected in DNA barcoding for all the barcodes tested except for the trnH–psbA spacer. BLAST analysis revealed that the Smilax species was predominant in all the samples for each barcode; therefore, the genus Smilax was confirmed through DNA barcode analysis. The barcode sequences psbK-psbI, atpF-atpH, and ITS2 had a better resolution at the species level in phylogenetic analysis than the other barcodes we tested.


2020 ◽  
Vol 21 (8) ◽  
Author(s):  
Viet The Ho ◽  
MINH PHUONG NGUYEN

Abstract. Ho VT, Nguyen MP. 2020. An in silico approach for evaluation of rbcL and matK loci for DNA barcoding of Cucurbitaceae family. Biodiversitas 21: 3879-3885. DNA barcodes have been used intensively to discriminate different species in Cucurbitaceae family. The main of this study is to evaluate the effectiveness of rbcL and matK loci for 16 species of Cucurbitaceae family by using in silico approach. For analysis, sequences were firstly retrieved from NCBI and then calculated for sequence parameters. Sequences were then aligned and constructed phylogenetic try and examined for species resolution ability. The obtained data show the variability of resolving capacity among species. rbcL region is suitable for distinguishing five species namely S. edule, M. cochinchinensis, L. aegyptiaca, C. melo, and C. pepo, whereas matK locus is more proper for different five species consisting of M. balsamina, M. cochinchinensis, M. charantia, S. edule, and C. sativus. The resolving power is improved sharply by analyzing the rbcL + matK combination with up to nine species consisting of C. lanatus, B. hispida, C. melo, C. sativus, C. pepo, C. agryrosperma, L. aegyptiaca, S. edule, and M. cochinchinensis. Therefore, the integration of rbcL and matK loci may improve the competence of assessing genetic relatedness at species level of members in Cucurbitaceae family. The obtained information could be important for choosing proper DNA barcode loci for phylogenetic study of this crop family.


Sign in / Sign up

Export Citation Format

Share Document