scholarly journals Algorithms for connecting scientific names with literature in the Biodiversity Heritage Library via the Global Names Project and Catalogue of Life

Author(s):  
Geoffrey Ower ◽  
Dmitry Mozzherin

Being able to quickly find and access original species descriptions is essential for efficiently conducting taxonomic research. Linking scientific name queries to the original species description is challenging and requires taxonomic intelligence because on average there are an estimated three scientific names associated with each currently accepted species, and many historical scientific names have fallen into disuse from being synonymized or forgotten. Additionally, non-standard usage of journal abbreviations can make it difficult to automatically disambiguate bibliographic citations and ascribe them to the correct publication. The largest open access resource for biodiversity literature is the Biodiversity Heritage Library (BHL), which was built by a consortium of natural history institutions and contains over 200,000 digitized volumes of natural history publications spanning hundreds of years of biological research. Catalogue of Life (CoL) is the largest aggregator of scientific names globally, publishing an annual checklist of currently accepted scientific names and their historical synonyms. TaxonWorks is an integrative web-based workbench that facilitates collaboration on biodiversity informatics research between scientists and developers. The Global Names project has been collaborating with BHL, TaxonWorks, and CoL to develop a Global Names Index that links all of these services together by finding scientific names in BHL and using the taxonomic intelligence provided by CoL to conveniently link directly to the page referenced in BHL. The Global Names Index is continuously updated as metadata is improved and digitization technologies advance to provide more accurate optical character recognition (OCR) of scanned texts. We developed an open source tool, “BHLnames,” and launched a restful application programming interface (API) service with a freely available Javascript widget that can be embedded on any website to link scientific names to literature citations in BHL. If no bibliographic citation is provided, the widget will link to the oldest name usage in BHL, which often is the original species description. The BHLnames widget can also be used to browse all mentions of a scientific name and its synonyms in BHL, which could make the tool more broadly useful for studying the natural history of any species.

Author(s):  
Donald Sturgeon

Abstract This article presents technical approaches and innovations in digital library design developed during the design and implementation of the Chinese Text Project, a widely-used, large-scale full-text digital library of premodern Chinese writing. By leveraging a combination of domain-optimized Optical Character Recognition, a purpose-designed crowdsourcing system, and an Application Programming Interface (API), this project simultaneously provides a sustainable transcription system, search interface and reading environment, as well as an extensible platform for transcribing and working with premodern Chinese textual materials. By means of the API, intentionally loosely integrated text mining tools are used to extend the platform, while also being reusable independently with materials from other sources and in other languages.


Author(s):  
Abraham Nieva de la Hidalga ◽  
David Owen ◽  
Irena Spacic ◽  
Paul Rosin ◽  
Xianfang Sun

The need to increase global accessibility to specimens while preserving the physical specimens by reducing their handling motivates digitisation. Digitisation of natural history collections has evolved from recording of specimens’ catalogue data to including digital images and 3D models of specimens. The sheer size of the collections requires developing high throughput digitisation workflows, as well as novel acquisition systems, image standardisation, curation, preservation, and publishing. For instance, herbarium sheet digitisation workflows (and fast digitisation stations) can digitise up to 6,000 specimens per day; operating digitisation stations in parallel can increase that capacity. However, other activities of digitisation workflows still rely on manual processes which throttle the speed with which images can be published. Image quality control and information extraction from images can benefit from greater automation. This presentation explores the advantages of applying semantic segmentation (Fig. 1) to improve and automate image quality management (IQM) and information extraction from images (IEFI) of physical specimens. Two experiments were designed to determine if IQM and IEFI activities can be improved by using segments instead of full images. The time for segmenting full images needs to be considered for both IQM and IEFI. A semantic segmentation method developed by the Natural History Museum (Durrant and Livermore 2018) adapted for segmenting herbarium sheet images (Dillen et al. 2019) can process 50 images in 12 minutes. The IQM experiments evaluated the application of three quality attributes to full images and to image segments: colourfulness (Fig. 2), contrast (Fig. 3) and sharpness (Fig. 4). Evaluating colourfulness is an alternative to colour quantization algorithms such as RMSE and Delta E (Hasler and Suesstrunk 2003, Palus 2006), the method produces a value indicating if the image degrades after processing. Contrast measures the difference in luminance or colour that makes an object distinguishable. Contrast is determined by the difference in colour and brightness of the object and other objects within the same field of view (Matkovic et al. 2005, Präkel 2010). Sharpness encompasses the concepts of resolution and acutance (Bahrami and Kot 2014, Präkel 2010). Sharpness influences specimen appearance and readability of information from labels and barcodes. Evaluating the criteria on 56 barcodes and 50 colour charts segments extracted from fifty images took 34 minutes (8 minutes for the barcodes and 26 minutes for colour charts). The evaluation on the corresponding full images took 100 minutes. The processing of individual segments and full images provided results equivalent to subjective manual quality management. The IEFI experiments compared the performance of four optical character recognition (OCR) programs applied to full images (Drinkwater et al. 2014) against individual segments. The four OCR programs evaluated were Tesseract 4.X, Tesseract 3.X, Abby FineReader Engine 12, and Microsoft OneNote 2013. The test was based on a set of 250 herbarium sheet images and 1,837 segments extracted from them. The results from the experiments show that there is an average OCR speed-up of 49% when using segmented images when compared to processing times for full images (Table 1). Similarly, there was an average increase of 13% in line correctness (information from lines is ordered and not fragmented (Fig. 5, Table 2 ). Additionally, the results are useful for comparing the four OCR programs, with Tesseract 3.x offering shortest processing time, while Tesseract 4.X achieving the highest scores for line accuracy (including hand written text recognition). The results suggest that IEFI could be improved by performing OCR using segments rather than whole images, leading to faster processing and more accurate outputs. The findings support the feasibility of further automation of digitisation workflows for natural history collections. In addition to increasing the accuracy and speed of IQM and IEFI activities, the explored approaches can be packaged and published, enabling automated quality management and information extraction to be offered as a service, taking advantage of cloud platforms and workflow engines.


Author(s):  
Constance Rinaldo

This will be a short introduction to the symposium: Improving access to hidden scientific data in the Biodiversity Heritage Library. The symposium will present examples of how the Biodiversity Heritage Library (BHL) collaborates across the international consortium and with community partners around the world to help enhance access to the biodiversity literature. Literature repositories, particularly the BHL collections, have been recognized as critical to the global scientific community. A diverse global user community propels BHL and BHL users to develop access tools beyond the standard “title, author, subject” search. BHL utilizes the Global Names Recognition and Discovery (GNRD) service to identify taxonomic names within text rendered by Optical Character Recognition (OCR) software, enabling scientific name searches and creation of species-specific bibliographies, critical to systematics research. In this symposium, we will hear from international partners and creative users making data from the BHL globally accessible for the kinds of larger-scale analysis enabled by BHL’s full-text search capabilities and Application Program Interface (API) protocols. In addition to taxonomic name services already incorporated in BHL, the consortium has also begun exploring georeferencing strategies for better searching and potential connections with key biodiversity resources such as the Global Biodiversity Information Facility (GBIF). With many different institutions around the world participating, the ability to work together virtually is critical for a seamless end product that meets the demands of the international community as well as the needs of local institutions.


Author(s):  
Mark Hereld ◽  
Nicola Ferrier

Digital technology presents us with new and compelling opportunities for discovery when focused on the world's natural history collections. The outstanding barrier to applying existing and forthcoming computational methods for large-scale study of this important resource is that it is (largely) not yet in the digital realm. Without development of new and much faster methods for digitizing objects in these collections, it will be a long time before these data are available in digital form. For example, methods that are currently employed for capturing, cataloguing, and indexing pinned insect specimen data will require many tens of years or more to process collections with millions of dry specimens, and so we need to develop a much faster pipeline. In this paper we describe a capture system capable of collecting and archiving the imagery necessary to digitize a collection of circa 4.5 million specimens in one or two years of production operation. To minimize the time required to digitize each specimen, we have proposed (Hereld et al. 2017) developing multi-camera systems to capture the pinned insect and its accompanying labels from many angles in a single exposure. Using a sampling (21 randomly drawn drawers, totalling 5178 insects) of the 4.5 million specimens in the collection at the Field Museum of Natural History, we estimated that a large fraction of that collection (97.6% +/- 2.2%) consists of pinned insects with labels that are visible from one angle or another without requiring adjustment or removal of elements on the pin. In this situation a multi-camera system with enough angular coverage could provide imagery for reconstructing virtual labels from fragmentary views taken from different directions. Agarwal et al. (2018) demonstrated a method for combining these multiple views into a virtual label that could be transcribed by automated optical character recognition software. We have now designed, built and tested a prototype snapshot 3D digitization station to allow rapid capture of multi-view imagery for automated capture of pinned insect specimens and labels. It consists of twelve very small and light 8-megapixel cameras (Fig. 1), each controlled by a small dedicated computer. The cameras are arrayed around the target volume, six on each side of the sample feed path. Their positions and orientations are fixed by a 3D-printed scaffolding designed for the purpose. The twelve camera controllers and a master computer are connected to a dedicated high-speed data network over which all of the coordinating control signals and returning images and metadata are passed. The system is integrated with a high-performance object store that includes a database for metadata and the archived images comprising each snapshot. The system is designed so that it can be readily extended to include additional or different sensors. The station is meant to be fed with specimens by a conveyor belt whose motion is coordinated with the exposure of the multi-view snapshots. In order to test the performance of the system we added a recirculating specimen feeder designed expressly for this experiment. With it integrated into the system in place of a conventional conveyor belt we are able to provide a continuous stream of targets for the digitization system to facilitate long tests of its performance and robustness. We demonstrated the ability to capture data at a peak rate of 1400 specimens per hour and an average rate of 1000 specimens per hour over the course of a sustained 6 hour run. The dataset (Hereld and Ferrier 2018) collected in this experiment provides fodder for the further development of algorithms for the offline reconstruction and automatic transcription of the label contents.


1997 ◽  
Vol 9 (1-3) ◽  
pp. 58-77
Author(s):  
Vitaly Kliatskine ◽  
Eugene Shchepin ◽  
Gunnar Thorvaldsen ◽  
Konstantin Zingerman ◽  
Valery Lazarev

In principle, printed source material should be made machine-readable with systems for Optical Character Recognition, rather than being typed once more. Offthe-shelf commercial OCR programs tend, however, to be inadequate for lists with a complex layout. The tax assessment lists that assess most nineteenth century farms in Norway, constitute one example among a series of valuable sources which can only be interpreted successfully with specially designed OCR software. This paper considers the problems involved in the recognition of material with a complex table structure, outlining a new algorithmic model based on ‘linked hierarchies’. Within the scope of this model, a variety of tables and layouts can be described and recognized. The ‘linked hierarchies’ model has been implemented in the ‘CRIPT’ OCR software system, which successfully reads tables with a complex structure from several different historical sources.


2020 ◽  
Vol 2020 (1) ◽  
pp. 78-81
Author(s):  
Simone Zini ◽  
Simone Bianco ◽  
Raimondo Schettini

Rain removal from pictures taken under bad weather conditions is a challenging task that aims to improve the overall quality and visibility of a scene. The enhanced images usually constitute the input for subsequent Computer Vision tasks such as detection and classification. In this paper, we present a Convolutional Neural Network, based on the Pix2Pix model, for rain streaks removal from images, with specific interest in evaluating the results of the processing operation with respect to the Optical Character Recognition (OCR) task. In particular, we present a way to generate a rainy version of the Street View Text Dataset (R-SVTD) for "text detection and recognition" evaluation in bad weather conditions. Experimental results on this dataset show that our model is able to outperform the state of the art in terms of two commonly used image quality metrics, and that it is capable to improve the performances of an OCR model to detect and recognise text in the wild.


2014 ◽  
Vol 6 (1) ◽  
pp. 36-39
Author(s):  
Kevin Purwito

This paper describes about one of the many extension of Optical Character Recognition (OCR), that is Optical Music Recognition (OMR). OMR is used to recognize musical sheets into digital format, such as MIDI or MusicXML. There are many musical symbols that usually used in musical sheets and therefore needs to be recognized by OMR, such as staff; treble, bass, alto and tenor clef; sharp, flat and natural; beams, staccato, staccatissimo, dynamic, tenuto, marcato, stopped note, harmonic and fermata; notes; rests; ties and slurs; and also mordent and turn. OMR usually has four main processes, namely Preprocessing, Music Symbol Recognition, Musical Notation Reconstruction and Final Representation Construction. Each of those four main processes uses different methods and algorithms and each of those processes still needs further development and research. There are already many application that uses OMR to date, but none gives the perfect result. Therefore, besides the development and research for each OMR process, there is also a need to a development and research for combined recognizer, that combines the results from different OMR application to increase the final result’s accuracy. Index Terms—Music, optical character recognition, optical music recognition, musical symbol, image processing, combined recognizer  


Sign in / Sign up

Export Citation Format

Share Document