Humans have become a major factor in reshaping the Earth’s biosphere. One of the
major effects of human changes to the environment is an increase in the rate of
species extinction as compared to background rates. Biodiversity hotspots are areas
whose species assemblages are very rich (50% of the world’s plants and 42% of land
vertebrates) yet very threatened with extinction ( > 70% habitat destruction), and
which ought to be foci for conservation efforts. The intense peril in which the flora
of these endangered regions are requires an equally intense response from the scientific
community. This study investigated the benefits of adding genomic information to voucher
specimens to alleviate the Linnaean (lack of species description), Wallacean (lack of data
on species distribution) and Darwinian (lack of data on species evolution) shortfalls.
An open-source R bioinformatic pipeline was developed to determine the percentage of vascular
plant species present in biodiversity hotspots with at least one reproducible DNA sequence
deposited on GenBank. Reproducible DNA sequences were defined as being underpinned by traceable
material and methods and accurate taxonomic identifications. A vascular plant species checklist
for the 36 biodiversity hotspots was inferred using 32,914,892 GBIF occurrences, comprising 204,044
species. A total of 736,532 GenBank accessions (representing DNA barcodes) were downloaded for
those species. Associated abstracts and metadata were mined from 3,127 publications deposited on
PubMed to assess DNA sequences reproducibility. The reproducibility of each study was tested by a
sentiments (natural language processing) analysis.
Overall, the analyses indicated that the reproducibility crisis also extended to the realm of
biodiversity. There was a significant shortfall in genetic information available for biodiversity
hotspots, where 80.3% of the sequences produced (591,431) were not reproducible. This meant that
only 19.7% of sequences—representing only 37,637 species (18% of the total)— were reproducible.
This phenomenon was named the Wu-Meyersian shortfall to recognize that we are critically lacking
DNA sequence data for threatened biodiversity. This shortfall was named in honor of Ray Wu (the
father of DNA sequencing; 1928-2008) and Norman Meyers (a pioneer in establishing biodiversity
hotspots; 1934-2019). Working on this shortfall could contribute to alleviating the Linnean,
Wallacean and Darwinian shortfalls and support conservation. Information was particularly lacking
in tropical biodiversity hotspots, but no biodiversity hotspot other than Japan had > 50% of its
flora reproducibly sequenced. Older biodiversity hotspots were less known than those established
more recently. This is concerning since those are among the most diverse and threatened (e.g.
Madagascar, Sundaland). From a DNA region perspective, ITS (23,422 species), matK (17,164
species), and rbcL (16,509 species) were the most commonly used barcodes. From a lineage perspective,
gymnosperms (N=895) are exceptionally well-sequenced, with three quarters of their species having been
reproducibly sequenced. Angiosperms are comparatively poorly sequenced (18%), but this may be explained
by their extreme diversity (N=195,433). Finally, ferns and their allies (N=7,716) are poorly sequenced
(22%). This is especially troubling because extinction of these species would represent the loss of
hundreds of millions of years of unique evolutionary history. This study finally proposed best practices
to ensure maximizing reproducibility of DNA sequences produced by the scientific community.
The bioinformatic pipeline can be applied to systems at multiple geographical scales and any
taxonomic groups and is therefore appealing to a wide range of stakeholders. We recommended using
it periodically to monitor progress towards alleviating the Wu-Meyersian shortfall.