scholarly journals Our journey to digital curation of the Jeghers Medical Index

2017 ◽  
Vol 105 (3) ◽  
Author(s):  
Lori Gawdyda ◽  
Kimbroe Carter ◽  
Mark Willson ◽  
Denise Bedford

Background: Harold Jeghers, a well-known medical educator of the twentieth century, maintained a print collection of about one million medical articles from the late 1800s to the 1990s. This case study discusses how a print collection of these articles was transformed to a digital database.Case Presentation: Staff in the Jeghers Medical Index, St. Elizabeth Youngstown Hospital, converted paper articles to Adobe portable document format (PDF)/A-1a files. Optical character recognition was used to obtain searchable text. The data were then incorporated into a specialized database. Lastly, articles were matched to PubMed bibliographic metadata through automation and human review. An online database of the collection was ultimately created. The collection was made part of a discovery search service, and semantic technologies have been explored as a method of creating access points.Conclusions: This case study shows how a small medical library made medical writings of the nineteenth and twentieth centuries available in electronic format for historic or semantic research, highlighting the efficiencies of contemporary information technology.

2011 ◽  
Vol 128-129 ◽  
pp. 1303-1307
Author(s):  
Yu Mei Wu ◽  
Zhi Fang Liu

Many efforts have been taken to achieve automated Graphical User Interface (GUI) testing. The most popular way is model-based testing, which supports automated test case generation and execution. But building such a model is a non-trivial task, which usually costs the most work-load in the entire testing process. Most of the approaches about automated model deriving are dependant on the programming language or specific OS. In this paper, we proposed a new approach of GUI modeling using Optical Character Recognition (OCR), and technical poblems encountered have been analyzed in deatail. Case study shows that our approach is capable of analyzing most of the GUI windows, and generating corresponding model and hence eliminates the above constraint.


2019 ◽  
Vol 34 (4) ◽  
pp. 825-843 ◽  
Author(s):  
Mark J Hill ◽  
Simon Hengchen

Abstract This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.


2013 ◽  
Vol 61 (3) ◽  
pp. 17-22 ◽  
Author(s):  
Bhagirath Kumar ◽  
Niraj Kumar ◽  
Charulata Palai ◽  
Pradeep Kumar Jena ◽  
Subhagata Chattopadhyay

Author(s):  
Marcus Guidoti ◽  
Carolina Sokolowicz ◽  
Felipe Simoes ◽  
Valdenar Gonçalves ◽  
Tatiana Ruschel ◽  
...  

Plazi's TreatmentBank is a research infrastructure and partner of the recent European Union-funded Biodiversity Community Integrated Knowledge Library (BiCIKL) project to provide a single knowledge portal to open, interlinked and machine-readable, findable, accessible, interoperable and reusable (FAIR) data. Plazi is liberating published biodiversity data that is trapped in so-called flat formats, such as portable document format (PDF), to increase its FAIRness. This can pose a variety of challenges for both data mining and curation of the extracted data. The automation of such a complex process requires internal organization and a well established workflow of specific steps (e.g., decoding of the PDF, extraction of data) to handle the challenges that the immense variety of graphic layouts existing in the biodiversity publishing landscape can impose. These challenges may vary according to the origin of the document: scanned documents that were not initially digital, need optical character recognition in order to be processed. Processing a document can either be an individual, one-time-only process, or a batch process, in which a template for a specific document type must be produced. Templates consist of a set of parameters that tell Plazi-dedicated software how to read and where to find key pieces of information for the extraction process, such as the related metadata. These parameters aim to improve the outcome of the data extraction process, and lead to more consistent results than manual extraction. In order to produce such templates, a set of tests and accompanying statistics are evaluated, and these same statistics are constantly checked against ongoing processing tasks in order to assess the template performance in a continuous manner. In addition to these steps that are intrinsically associated with the automated process, different granularity levels (e.g., low granularity level might consist of a treatment and its subsections versus a high granularity level that includes material citations down to named entities such as collection codes, collector, collecting date) were defined to accommodate specific needs for particular projects and user requirements. The higher the granularity level, the more thoroughly checked the resulting data is expected to be. Additionally, steps related to the quality control (qc), such as the “pre-qc”, “qc” and “extended qc” were designed and implemented to ensure data quality and enhanced data accuracy. Data on all these different stages of the processing workflow are constantly being collected and assessed in order to improve these very same stages, aiming for a more reliable and efficient operation. This is also associated with a current Data Architecture plan to move this data assessment to a cloud provider to promote real-time assessment and constant analyses of template performance and processing stages as a whole. In this talk, the steps of this entire process are explained in detail, highlighting how data are being used to improve these steps towards a more efficient, accurate, and less costly operation.


2019 ◽  
Vol 4 (2) ◽  
pp. 335
Author(s):  
Anis Nadiah Che Abdul Rahman ◽  
Imran Ho Abdullah ◽  
Intan Safinaz Zainuddin ◽  
Azhar Jaludin

Optical Character Recognition (OCR) is a tool in computational technology that allows a recognition of printed characters by manipulating photoelectric devices and computer software. It runs by converting images or texts that are scanned beforehand into machine-readable and editable texts. There are a various numbers of OCR tools in the market for commercial and research use, which are obtainable for free or restrained with purchases. An OCR tool is able to enhance the accuracy of the results which as well relies on pre-processing and subdivision of algorithms. This study intends to investigate the performances of OCR tools in converting the Parliamentary Reports of Hansard Malaysia for developing the Malaysian Hansard Corpus (MHC). By comparing four OCR tools, the study has converted ten reports of Parliamentary Reports which contains a number of 62 pages to see the conversion accuracy and error rate of each conversion tool. In this study, all of the tools are manipulated to convert Adobe Portable Document Format (PDF) files into Plain Text File (txt). The objective of this study is to give an overview based on accuracy and error rate of how each OCR tools essentially works and how it can be utilized to provide assistance towards corpus building. The study indicates that each tool possesses a variety of accuracy and error rates to convert the whole documents from PDF into txt or plain text files. The study proposes that a step of corpus building can be made easier and manageable when a researcher understands the way an OCR tool works in order to choose the best OCR tool prior to the outset of the corpus development.


1997 ◽  
Vol 9 (1-3) ◽  
pp. 58-77
Author(s):  
Vitaly Kliatskine ◽  
Eugene Shchepin ◽  
Gunnar Thorvaldsen ◽  
Konstantin Zingerman ◽  
Valery Lazarev

In principle, printed source material should be made machine-readable with systems for Optical Character Recognition, rather than being typed once more. Offthe-shelf commercial OCR programs tend, however, to be inadequate for lists with a complex layout. The tax assessment lists that assess most nineteenth century farms in Norway, constitute one example among a series of valuable sources which can only be interpreted successfully with specially designed OCR software. This paper considers the problems involved in the recognition of material with a complex table structure, outlining a new algorithmic model based on ‘linked hierarchies’. Within the scope of this model, a variety of tables and layouts can be described and recognized. The ‘linked hierarchies’ model has been implemented in the ‘CRIPT’ OCR software system, which successfully reads tables with a complex structure from several different historical sources.


2020 ◽  
Vol 2020 (1) ◽  
pp. 78-81
Author(s):  
Simone Zini ◽  
Simone Bianco ◽  
Raimondo Schettini

Rain removal from pictures taken under bad weather conditions is a challenging task that aims to improve the overall quality and visibility of a scene. The enhanced images usually constitute the input for subsequent Computer Vision tasks such as detection and classification. In this paper, we present a Convolutional Neural Network, based on the Pix2Pix model, for rain streaks removal from images, with specific interest in evaluating the results of the processing operation with respect to the Optical Character Recognition (OCR) task. In particular, we present a way to generate a rainy version of the Street View Text Dataset (R-SVTD) for "text detection and recognition" evaluation in bad weather conditions. Experimental results on this dataset show that our model is able to outperform the state of the art in terms of two commonly used image quality metrics, and that it is capable to improve the performances of an OCR model to detect and recognise text in the wild.


Sign in / Sign up

Export Citation Format

Share Document