TaxonKit: a practical and efficient NCBI Taxonomy toolkit

Author(s):  
Wei Shen ◽  
Hong Ren
Keyword(s):  
Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Conrad L Schoch ◽  
Stacy Ciufo ◽  
Mikhail Domrachev ◽  
Carol L Hotton ◽  
Sivakumar Kannan ◽  
...  

Abstract The National Center for Biotechnology Information (NCBI) Taxonomy includes organism names and classifications for every sequence in the nucleotide and protein sequence databases of the International Nucleotide Sequence Database Collaboration. Since the last review of this resource in 2012, it has undergone several improvements. Most notable is the shift from a single SQL database to a series of linked databases tied to a framework of data called NameBank. This means that relations among data elements can be adjusted in more detail, resulting in expanded annotation of synonyms, the ability to flag names with specific nomenclatural properties, enhanced tracking of publications tied to names and improved annotation of scientific authorities and types. Additionally, practices utilized by NCBI Taxonomy curators specific to major taxonomic groups are described, terms peculiar to NCBI Taxonomy are explained, external resources are acknowledged and updates to tools and other resources are documented. Database URL: https://www.ncbi.nlm.nih.gov/taxonomy


2007 ◽  
Author(s):  
Αίγλη Παπαθανασοπούλου
Keyword(s):  

Προτείνονται οι τεχνικές βελτιώσεις, MP6-5 και MP6-4, της ισοβαρούς φειδωλής μεθόδου για τον προσδιορισμό των εξελικτικών σχημάτων οι οποίες αντιμετωπίζουν την ομοπλασία των παραπλανητικών θέσεων. Με τον προσδιορισμό του πλήθους των πληροφοριακών και μη πληροφοριακών θέσεων και των σχετικών μαθηματικών αποδείξεων, γίνεται δυνατή η ανάλυσή της συμπεριφοράς της ισοβαρούς φειδωλής μεθόδου στο σύνολο των δυνατών πληροφοριακών θέσεων 4, 5 η 6 εξελικτικών μονάδων. Διαπιστώνεται ότι, όταν αυξάνει ο αριθμός Ν των εξελικτικών μονάδων, η συμμετοχή των εξεταζόμενων δέντρων, που υπόκεινται στο πρόβλημα της ομοπλασίας, επίσης αυξάνει. Επιπροσθέτως, προτείνεται ένας νέος τρόπος πολλαπλής στοίχισης ακολουθιών cDNA, που βασίζεται στην δομική στοίχιση της τρισδιάστατης δομής των πρωτεϊνών. Στην παρούσα διατριβή, αξιολογείται η αποτελεσματικότητα των προτεινόμενων τεχνικών βελτιώσεων MP6-5 και MP6-4 ως προς εκείνην της ισοβαρούς φειδωλής μεθόδου (MP) (i) σε τεχνητά δεδομένα και (ii) σε πραγματικά δεδομένα. Τα αποτελέσματα δείχνουν ότι: 1) σε προσομοιωμένες νουκλεοτιδικές ακολουθίες DNA μήκους 300-1200 βάσεων (i) η MP6-5 είναι καλύτερη από την MP και (ii) η MP6-4 είναι καλύτερη από την MP όταν η εξέλιξη έχει συμβεί συμμετρικά. 2) σε συγκριτικούς χάρτες περιοριστικών θέσεων οι τεχνικές βελτιώσεις MP6-5, MP6-4 είναι καλύτερες από τη μέθοδο MP. 3) σε πραγματικά δεδομένα στοιχισμένων ακολουθιών cDNA τεσσάρων οικογενειών ενζύμων SOD, DHFR, TIM, ADH, από μακρινούς εξελικτικά οργανισμούς: • οι τεχνικές βελτιώσεις MP6-5, MP6-4 και η μέθοδος MP προτείνουν, ως πλέον φειδωλές, τις ίδιες τοπολογίες • στις τρεις οικογένειες ενζύμων, SOD, TIM, ADH, τα προτεινόμενα, από όλους τους εφαρμοζόμενους αλγόριθμους, πλέον φειδωλά σχήματα είναι όμοια με τα εξελικτικά σχήματα που υποδεικνύει η NCBI taxonomy. Στην οικογένεια DHFR το προτεινόμενο εξελικτικό σχήμα έρχεται μεν σε αντίθεση με εκείνο της NCBI taxonomy, η αντίθεση ωστόσο περιορίζεται μόνο σε έναν εξελικτικό κλάδο (ομαδοποίηση αρχαιοβακτηρίου με βακτήρια). Από τα παραπάνω προκύπτει η ένδειξη ότι ο νέος τρόπος πολλαπλής στοίχισης των νουκλεοτιδικών ακολουθιών παρέχει αξιόπιστα δεδομένα σε εξελικτικές μελέτες.


Zootaxa ◽  
2019 ◽  
Vol 4706 (3) ◽  
pp. 401-407 ◽  
Author(s):  
AKHIL GARG ◽  
DETLEF LEIPE ◽  
PETER UETZ

We compared the species names in the Reptile Database, a dedicated taxonomy database, with those in the NCBI taxonomy database, which provides the taxonomic backbone for the GenBank sequence database. About 67% of the known ~11,000 reptile species are represented with at least one DNA sequence and a binary species name in GenBank. However, a common problem arises through the submission of preliminary species names (such as “Pelomedusa sp. A CK-2014”) to GenBank and thus the NCBI taxonomy. These names cannot be assigned to any accepted species names and thus create a disconnect between DNA sequences and species. While these names of unknown taxonomic meaning sometimes get updated, often they remain in GenBank which now contains sequences from ~1,300 such “putative” reptile species tagged by informal names (~15% of its reptile names). We estimate that NCBI/GenBank probably contain tens of thousands of such “disconnected” entries. We encourage sequence submitters to update informal species names after they have been published, otherwise the disconnect will cause increasing confusion and possibly misleading taxonomic conclusions.


Author(s):  
Takeru Nakazato

DNA barcoding technology has become employed widely for biodiversity and molecular biology researchers to identify species and analyze their phylogeny. Recently, DNA metabarcoding and environmental DNA (eDNA) technology have developed by expanding the concept of DNA barcoding. These techniques analyze the diversity and quantity of organisms within an environment by detecting biogenic DNA in water and soil. It is particularly popular for monitoring fish species living in rivers and lakes (Takahara et al. 2012). BOLD Systems (Barcode of Life Database systems, Ratnasingham and Hebert 2007) is a database for DNA barcoding, archiving 8.5 million of barcodes (as of August 2020) along with the voucher specimen, from which the DNA barcode sequence is derived, including taxonomy, collected country, and museum vouchered as metadata (e.g. https://www.boldsystems.org/index.php/Public_RecordView?processid=TRIBS054-16). Also, many barcoding data are submitted to GenBank (Sayers et al. 2020), which is a database for DNA sequences managed by NCBI (National Center for Biotechnology Information, US). The number of the records of DNA barcodes, i.e. COI (cytochrome c oxidase I) gene for animal, has grown significantly (Porter and Hajibabaei 2018). BOLD imports DNA barcoding data from GenBank, and lots of DNA barcoding data in GenBank are also assigned BOLD IDs. However, we have to refer to both BOLD and GenBank data when performing DNA barcoding. I have previously investigated the registration of DNA barcoding data in GenBank, especially the association with BOLD, using insects and flowering plants as examples (Nakazato 2019). Here, I surveyed the number of species covered by BOLD and GenBank. I used fish data as an example because eDNA research is particularly focused on fish. I downloaded all GenBank files for vertebrates from NCBI FTP (File Transfer Protocol) sites (as of November 2019). Of the GenBank fish entries, 86,958 (7.3%) were assigned BOLD identifiers (IDs). The NCBI taxonomy database has registrations for 39,127 species of fish, and 20,987 scientific names at the species level (i.e., excluding names that included sp., cf. or aff.). GenBank entries with BOLD IDs covered 11,784 species (30.1%) and 8,665 species-level names (41.3%). I also obtained whole "specimens and sequences combined data" for fish from BOLD systems (as of November 2019). In the BOLD, there are 273,426 entries that are registered as fish. Of these entries, 211,589 BOLD entries were assigned GenBank IDs, i.e. with values in “genbank_accession” column, and 121,748 entries were imported from GenBank, i.e. with "Mined from GenBank, NCBI" description in "institution_storing" column. The BOLD data covered 18,952 fish species and 15,063 species-level names, but 35,500 entries were assigned no species-level names and 22,123 entries were not even filled with family-level names. At the species level, 8,067 names co-occurred in GenBank and BOLD, with 6,997 BOLD-specific names and 599 GenBank-specific names. GenBank has 425,732 fish entries with voucher IDs, of which 340,386 were not assigned a BOLD ID. Of these 340,386 entries, 43,872 entries are registrations for COI genes, which could be candidates for DNA barcodes. These candidates include 4,201 species that are not included in BOLD, thus adding these data will enable us to identify 19,863 fish to the species level. For researchers, it would be very useful if both BOLD and GenBank DNA barcoding data could be searched in one place. For this purpose, it is necessary to integrate data from the two databases. A lot of biodiversity data are recorded based on the Darwin Core standard while DNA sequencing data are sometimes integrated or cross-linked by RDF (Resource Description Framework). It may not be technically difficult to integrate these data, but the species data referenced differ from the EoL (The Encyclopedia of Life) for BOLD and the NCBI taxonomy for GenBank, and the differences in taxonomic systems make it difficult to match by scientific name description. GenBank has fields for the latitude and longitude of the specimens sampled, and Porter and Hajibabaei 2018 argue that this information should be enhanced. However, this information may be better described in the specimen and occurrence databases. The integration of barcoding data with the specimen and occurrence data will solve these problems. Most importantly, it will save the researcher from having to register the same information in multiple databases. In the field of biodiversity, only DNA barcode sequences may have been focused on and used as gene sequences. The museomics community regards museum-preserved specimens as rich resources for DNA studies because their biodiversity information can accompany the extraction and analysis of their DNA (Nakazato 2018). GenBank is useful for biodiversity studies due to its low rate of mislabelling (Leray et al. 2019). In the future, we will be working with a variety of DNA, including genomes from museum specimens as well as DNA barcoding. This will require more integrated use of biodiversity information and DNA sequence data. This integration is also of interest to molecular biologists and bioinformaticians.


Author(s):  
Marcos Zárate ◽  
Paula Zermoglio ◽  
John Wieczorek ◽  
Anabela Plos ◽  
Renato Mazzanti

Scientists frequently collect biological and environmental information over years and store it in database systems to answer their own research questions without exposing it in repositories that make it easy to find and retrieve. While in recent years the community working on biodiversity informatics has made significant strides by creating common shared vocabularies such as the Darwin Core (DwC, Wieczorek et al. 2012) and publishing mechanisms such as the Integrated Publishing Toolkit (IPT, Robertson et al. 2014), integration is largely limited to the aggregation of datasets and full interoperability has still not been achieved. In this context, The Semantic Web (SW) aims to represent information in a way that, in addition to the human-centered display purposes, it can be used autonomously by machines for integration and reuse across applications. From the biodiversity informatics point of view, interoperability and links among data sources would allow integration of information that is otherwise disconnected, enabling scientists to answer broader questions. These considerations provide strong motivations to formulate a web application considering the semantic interoperability that may provide answers to questions such as the following: (Q1) Is it possible to complement taxonomic, bibliographic and environmental information of a particular species without relying on specific Application Programming Interfaces (APIs)? (Q2) How to relate occurrences of species with environmental variables within a specific region? (Q3) What are the bibliographic references associated with a given species? (Q1) Is it possible to complement taxonomic, bibliographic and environmental information of a particular species without relying on specific Application Programming Interfaces (APIs)? (Q2) How to relate occurrences of species with environmental variables within a specific region? (Q3) What are the bibliographic references associated with a given species? With questions such as these in mind, we present the design of a proof-of-concept application: Linked Open Biodiversity Data (LOBD). LOBD uses Linked Data (LD) (Heath and Bizer 2011) to complement species occurrence information previously extracted from GBIF and converted to Resource Description Framework (RDF) (Zárate et al. 2020) with information about the taxa in question from different RDF datasets, such as Wikidata, NCBI Taxonomy, Springer Nature SciGraph and OpenCitation corpus. A simplified view of the architecture is shown in Fig. 1. To achieve semantic interoperability, we use the SPARQL query language, which allows us not to depend on specific APIs to retrieve information. The application consists of three modules: General information, where the Wikidata endpoint is used to retrieve additional information about the selected species, including links to other databases and information about the species extracted from National Center for Biotechnology Information (NCBI) Taxonomy. Bibliography, where all publications related to the species are retrieved and extracted from OpenCitation. Environment, where users can plot species on a map and add layers related to marine regions as well as environmental layers (e.g., temperature, salinity, etc). General information, where the Wikidata endpoint is used to retrieve additional information about the selected species, including links to other databases and information about the species extracted from National Center for Biotechnology Information (NCBI) Taxonomy. Bibliography, where all publications related to the species are retrieved and extracted from OpenCitation. Environment, where users can plot species on a map and add layers related to marine regions as well as environmental layers (e.g., temperature, salinity, etc). For the development of the application, we use the Shiny framework for R, access to SPARQL endpoints is done through the SPARQL package, marine regions are obtained from marineregion.org and the environmental layers are extracted from Bio-ORACLE. The data used for this article were collected by the Center for the Study of Marine Systems at the National Patagonian Sci-Tech Centre (CCT CENPAT-CONICET), and are published and available through the GBIF network. Linked Data is a powerful tool for scientists, as it allows generating new approaches to biodiversity informatics, which can help to address the data integration challenges. Users would benefit from complementing the current prevalent use of vocabularies that are not ontologically defined (like DwC) for sharing biodiversity data. Although this application is a proof of concept, it shows that with little effort, it is possible to achieve greater interoperability between datasets that were not initially represented as LD.


1999 ◽  
Vol 36 (09) ◽  
pp. 36-5071-36-5071
Keyword(s):  

2014 ◽  
Vol 43 (D1) ◽  
pp. D1086-D1098 ◽  
Author(s):  
Scott Federhen

2017 ◽  
Author(s):  
Eneida Hatcher ◽  
Yiming Bao ◽  
Paolo Amedeo ◽  
Olga Blinkova ◽  
Guy Cochrane ◽  
...  

Currently the National Center of Biotechnology Information (NCBI) assigns individual taxonomy identifiers to each distinct influenza virus isolate submitted to GenBank. To support this practice, individual flu isolates must be manually added to the NCBI taxonomy database and unique taxonomy identifiers generated. This added layer of manual processing is unique to influenza virus and prevents automatization of the flu sequence submission process. Here we outline a new NCBI policy that normalizes influenza virus taxonomy processing but maintains features supported by the previous approach. This change will reduce the amount of manual handling necessary for flu submissions and pave the way for increased automation of the submissions process. While this automation may disrupt some historic practices, it will better align influenza virus data processing with other viruses and ultimately lower the submission burden on data providers.


2011 ◽  
Vol 40 (D1) ◽  
pp. D136-D143 ◽  
Author(s):  
S. Federhen

2014 ◽  
Vol 9 (3) ◽  
pp. 1275-1277 ◽  
Author(s):  
Scott Federhen ◽  
Karen Clark ◽  
Tanya Barrett ◽  
Helen Parkinson ◽  
James Ostell ◽  
...  
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document