scholarly journals Assessment of current taxonomic assignment strategies for metabarcoding eukaryotes

Author(s):  
Jose S. Hleap ◽  
Joanne E. Littlefair ◽  
Dirk Steinke ◽  
Paul D. N. Hebert ◽  
Melania E. Cristescu

ABSTRACTThe effective use of metabarcoding in biodiversity science has brought important analytical challenges due to the need to generate accurate taxonomic assignments. The assignment of sequences to a generic or species level is critical for biodiversity surveys and biomonitoring, but it is particularly challenging. Researchers must select the approach that best recovers information on species composition. This study evaluates the performance and accuracy of seven methods in recovering the species composition of mock communities which vary in species number and specimen abundance, while holding upstream molecular and bioinformatic variables constant. It also evaluates the impact of parameter optimization on the quality of the predictions. Despite the general belief that BLAST top hit underperforms newer methods, our results indicate that it competes well with more complex approaches if optimized for the mock community under study. For example, the two machine learning methods that were benchmarked proved more sensitive to the reference database heterogeneity and completeness than methods based on sequence similarity. The accuracy of assignments was impacted by both species and specimen counts which will influence the selection of appropriate software. We urge the usage of realistic mock communities to allow optimization of parameters, regardless of the taxonomic assignment method used.

2020 ◽  
Vol 36 (18) ◽  
pp. 4699-4705
Author(s):  
Hamid Bagheri ◽  
Andrew J Severin ◽  
Hridesh Rajan

Abstract Motivation As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. Results We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. Availability and implementation Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Hamid Bagheri ◽  
Robert Dyer ◽  
Andrew Severin ◽  
Hridesh Rajan

Abstract Background: Scientists around the world use NCBI’s non-redundant (NR) database to identify the taxonomic origin and functional annotation of their favorite protein sequences using BLAST. Unfortunately, due to the exponential growth of this database, many scientists do not have a good understanding of the contents of the NR database. There is a need for tools to explore the contents of large biological datasets, such as NR, to better understand the assumptions and limitations of the data they contain. Results: Protein sequence data, protein functional annotation, and taxonomic assignment from NCBI’s NR database were placed into a BoaG database, a domain-specific language and shared data science infrastructure for genomics, along with a CD-HIT clustering of all these protein sequences at different sequence similarity levels. We show that BoaG can efficiently perform queries on this large dataset to determine the average length of protein sequences and identify the most common taxonomic assignments and functional annotations. Using the clustering information, we also show that the non-redundant (NR) database has a considerable amount of annotation redundancy at the 95% similarity level. Conclusions: We implemented BoaG and provided a web-based interface to BoaG’s infrastructure that will help researchers to explore the dataset further. Researchers can submit queries and download the results or share them with others. Availability and implementation: The web-interface of the BoaG infrastructure can be accessed here: http://boa.cs.iastate.edu/boag. Please use user = boag and password = boag to login. Source code and other documentation are also provided as a GitHub repository: https://github.com/boalang/NR_Dataset.


Author(s):  
N. B. Istomina ◽  
◽  
O. V. Likhacheva ◽  

The article describes the results of studies of the factors affecting lichen diversity in 46 manor parks of the region of Pskov. The investigated parks were founded in the end of XVIII – beginning of XX centuries. Twenty of them are fragmented and currently occupy less than 5 ha, the area of nine parks varies from 5 to 10 ha, those preserved within historical boundaries cover from 11 to 100 ha. Manor parks are situated both within the settlements’ boundaries (31 parks) and outside the settlements either bordering forest (12) or agricultural lands (3). Ten of the former border the forests. During the study 166 lichen species were identified. Statistical methods were performed to investigate the factors affecting lichen diversity in manor parks. Linear regression analysis was used to examine the dependence of the lichens species number on the park age/date of park creation (dispersion pattern), area of the parks (box plot), substrata diversity (dispersion pattern), and the dependence of the epiphytic lichens species number on tree and shrub species diversity (dispersion pattern). With the biserial correlation coefficient the impact of the settlement and the presence of the surrounding natural forests was calculated. Correlation analysis was performed to demonstrate the colligation between lichen species composition of the parks located in different subzones of forest zone in the region of Pskov. Our findings show that the number of lichen species depends on the park area (p = 0,0315), the variety of substrate types (p ˂ 0,001), and the variety of trees and bushes planted (p ˂ 0,001). The date of park creation and the presence of the surrounding natural forests do not influence the species diversity of lichens. We reveal that the location of the parks in a specific subzone of forest zone (southern taiga and mixed coniferous-broad-leaved forest) has no significant effect on the lichen species composition. The species richness of lichens tends to decrease in parks located within the settlements. The data obtained indicate not only the similarity of the species composition of lichens in the studied communities, but also the long-term development of lichen park communities in comparable climatic and landscape conditions.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11865
Author(s):  
Dylan Catlett ◽  
Kevin Son ◽  
Connie Liang

Background High-throughput sequencing of phylogenetically informative marker genes is a widely used method to assess the diversity and composition of microbial communities. Taxonomic assignment of sampled marker gene sequences (referred to as amplicon sequence variants, or ASVs) imparts ecological significance to these genetic data. To assign taxonomy to an ASV, a taxonomic assignment algorithm compares the ASV to a collection of reference sequences (a reference database) with known taxonomic affiliations. However, many taxonomic assignment algorithms and reference databases are available, and the optimal algorithm and database for a particular scientific question is often unclear. Here, we present the ensembleTax R package, which provides an efficient framework for integrating taxonomic assignments predicted with any number of taxonomic assignment algorithms and reference databases to determine ensemble taxonomic assignments for ASVs. Methods The ensembleTax R package relies on two core algorithms: taxmapper and assign.ensembleTax. The taxmapper algorithm maps taxonomic assignments derived from one reference database onto the taxonomic nomenclature (a set of taxonomic naming and ranking conventions) of another reference database. The assign.ensembleTax algorithm computes ensemble taxonomic assignments for each ASV in a data set based on any number of taxonomic assignments determined with independent methods. Various parameters allow analysts to prioritize obtaining either more ASVs with more predicted clade names or more robust clade name predictions supported by multiple independent methods in ensemble taxonomic assignments. Results The ensembleTax R package is used to compute two sets of ensemble taxonomic assignments for a collection of protistan ASVs sampled from the coastal ocean. Comparisons of taxonomic assignments predicted by individual methods with those predicted by ensemble methods show that conservative implementations of the ensembleTax package minimize disagreements between taxonomic assignments predicted by individual and ensemble methods, but result in ASVs with fewer ranks assigned taxonomy. Less conservative implementations of the ensembleTax package result in an increased fraction of ASVs classified at all taxonomic ranks, but increase the number of ASVs for which ensemble assignments disagree with those predicted by individual methods. Discussion We discuss how implementation of the ensembleTax R package may be optimized to address specific scientific objectives based on the results of the application of the ensembleTax package to marine protist communities. While further work is required to evaluate the accuracy of ensemble taxonomic assignments relative to taxonomic assignments predicted by individual methods, we also discuss scenarios where ensemble methods are expected to improve the accuracy of taxonomy prediction for ASVs.


2015 ◽  
Vol 59 (4) ◽  
pp. 2113-2121 ◽  
Author(s):  
U. Malik ◽  
O. N. Silva ◽  
I. C. M. Fensterseifer ◽  
L. Y. Chan ◽  
R. J. Clark ◽  
...  

ABSTRACTStaphylococcus aureusis a virulent pathogen that is responsible for a wide range of superficial and invasive infections. Its resistance to existing antimicrobial drugs is a global problem, and the development of novel antimicrobial agents is crucial. Antimicrobial peptides from natural resources offer potential as new treatments against staphylococcal infections. In the current study, we have examined the antimicrobial properties of peptides isolated from anuran skin secretions and cyclized synthetic analogues of these peptides. The structures of the peptides were elucidated by nuclear magnetic resonance (NMR) spectroscopy, revealing high structural and sequence similarity with each other and with sunflower trypsin inhibitor 1 (SFTI-1). SFTI-1 is an ultrastable cyclic peptide isolated from sunflower seeds that has subnanomolar trypsin inhibitory activity, and this scaffold offers pharmaceutically relevant characteristics. The five anuran peptides were nonhemolytic and noncytotoxic and had trypsin inhibitory activities similar to that of SFTI-1. They demonstrated weakin vitroinhibitory activities againstS. aureus, but several had strong antibacterial activities againstS. aureusin anin vivomurine wound infection model. pYR, an immunomodulatory peptide fromRana sevosa, was the most potent, with complete bacterial clearance at 3 mg · kg−1. Cyclization of the peptides improved their stability but was associated with a concomitant decrease in antimicrobial activity. In summary, these anuran peptides are promising as novel therapeutic agents for treating infections from a clinically resistant pathogen.


2018 ◽  
Vol 23 (1) ◽  
pp. 1 ◽  
Author(s):  
Colin K. C. Wen ◽  
Li-Shu Chen ◽  
Kwang-Tsao Shao

Spatial and temporal variations in the species composition of assemblages are common in many marine organisms, including fishes. Variations in the fish species composition of subtidal coral reefs have been well documented, however much less is known about such differences for intertidal fish assemblages. This is surprising, given that intertidal fishes are more vulnerable to terrestrial human disturbances. It is critical to evaluate the ecology and biology of intertidal fishes before they are severely impacted by coastal development, especially in developing countries such as those in the tropical western Pacific region where coastal development is rapidly increasing. In this study, we investigated the species composition, abundance, biomass and species number (richness) for intertidal fish assemblages in subtropical (northern) and tropical (southern) Taiwan across four seasons by collecting fishes from tidepools using clove oil. We also examined the gut contents of collected fishes to identify their trophic functional groups in order to investigate regional and seasonal variations for different trophic groups. We found significant differences in the species composition of tidepool fish assemblages between subtropical and tropical Taiwan. Bathygobius fuscus, Abudefduf vaigiensis and Istiblennius dussumieri were dominant species in subtropical Taiwan, whereas Bathygobius coalitus, Abudefduf septemfasciatus and Istiblennius lineatus were dominant in tropical Taiwan. Other species such as Bathygobius cocosensis, Abudefduf sordidus and Istiblennius edentulus were common in both regions. For trophic groups, omnivores and detritivores had or showed trends towards higher species numbers and abundances in the subtropical region, whereas herbivores, planktivores and general carnivores had or showed trends towards higher species numbers and biomass in the tropical region. Overall, many intertidal fish species and trophic groups showed differences in abundance, biomass and species number between subtropical and tropical Taiwan. Further studies on large scale geographical gradients in trophic groups and species compositions in the Indo-west Pacific region are encouraged to assist with ecosystem monitoring and assessment. Keywords: Intertidal fishes, spatio-temporal pattern, feeding guild, diet


Biologia ◽  
2017 ◽  
Vol 72 (7) ◽  
Author(s):  
Mária Petrášová-Šibíková ◽  
Igor Matečný ◽  
Eva Uherčíková ◽  
Peter Pišút ◽  
Silvia Kubalová ◽  
...  

AbstractHuman alteration of watercourses is global phenomenon that has had significant impacts on local ecosystems and the services they provide. Monitoring of abiotic and biotic changes is essential to mitigating long-lasting effects, and the 23-year dataset from the Gabčíkovo Waterworks provided a rare opportunity to assess the impact of groundwater regimes on vegetation. The main aim of this study was to describe the effect of the Gabčíkovo Waterworks on vegetation structure and species composition of the adjacent riparian floodplain forests over the past 23 years. The results are based on studies of three permanent monitoring plots (PMPs) located in the Danube inland delta – two outside (PMP 1 and 3) and one (PMP 2) fully under the influence of the artificial supply system. Our results demonstrate that the Danube inland delta was negatively affected by the Gabčíkovo construction, particularly for sites outside of the artificial supply system. There was a significant decrease in soil moisture and increase in nitrogen at both external PMPs (1 and 3). Alter soil conditions were accompanied by negative changes in plant species composition demonstrated by decreases in the number of typical floodplain forest species that are characteristic for the alliance


Plant Disease ◽  
2013 ◽  
Vol 97 (4) ◽  
pp. 562-562 ◽  
Author(s):  
K. Hamed ◽  
W. Menzel ◽  
M. E. Mohamed ◽  
K. A. Bakheet ◽  
S. Winter

Garlic (Allium sativum L.) is one of the most important vegetable field crops in Sudan, cultivated on an area of more than 6,000 ha with a total yield of 27,000 t in 2010 (faostat.fao.org). As part of a project which started in 2010 to improve the garlic production in Sudan, samples from local varieties showing severe mosaic and/or mottling were collected in winter 2011 from the main production areas in River Nile State, Northern State, and Darfur State. The plant material used for garlic production came from Sudan and was not imported. Because no reliable data were available on which viruses occur in garlic in Sudan, specific tests were initially omitted. In order to get an overview of the viruses present, dsRNA was prepared of a mixed leaf sample (12 leaves of different samples). This resulted in a high molecular weight dsRNA of approximately 9 kbp that served as template for a random RT-PCR followed by cloning and sequencing (3). Three identical clones originating from one PCR product covering the C-terminal part of the coat protein to the N-terminal part of the nucleic acid binding protein showed the highest sequence similarity to Garlic common latent virus (GarCLV). The nucleotide sequence identities of the 554-bp insert range from 85% to an isolate from India (Accession No. FJ154841) up to 97% to a GarCLV isolate from The Netherlands (AB004804), identifying the virus as a Sudanese isolate of GarCLV, one of the most common garlic infecting viruses. GarCLV belongs to the genus Carlavirus (1) and has previously been reported from Asia, Europe, and South America ( http://sdb.im.ac.cn/vide/descr352.htm ). In order to confirm these results, a double antibody sandwich (DAS)-ELISA was performed with six individual garlic samples in which five samples showed a clear reaction with a GarCLV specific antiserum (AS-0230, DSMZ, Germany). The occurrence of GarCLV could be further confirmed for the ELISA positive samples by a specific RT-PCR using the primers published by Majumder and Baranwal (2). Fragments of the expected size were obtained for all five samples. In addition, one of the positive samples was examined by electron microscopy (Dr. K. Richert-Pöggeler, JKI Braunschweig); filamentous flexous particles typical for carlaviruses could be observed. The random RT-PCR sequence obtained in this study has been submitted to GenBank (KC013030). To our knowledge, this is the first report of GarCLV in garlic in Sudan and Africa. The impact of GarCLV on garlic production in Sudan needs to be evaluated, but the awareness of the occurrence of the virus and the availability of a reliable diagnostic tool will help to select virus-free propagation material. This will form the basis for a sustainable garlic production. References: (1) A. M. Q. King et al. Virus Taxonomy 924, 2012. (2) S. Majumder and V. K. Baranwal. Plant Dis. 93:106, 2009. (3) W. Menzel et al. Arch. Virol. 154:1343, 2009.


Sign in / Sign up

Export Citation Format

Share Document