Our journey to digital curation of the Jeghers Medical Index

Lori Gawdyda; Kimbroe Carter; Mark Willson; Denise Bedford

doi:10.5195/jmla.2017.47

Our journey to digital curation of the Jeghers Medical Index

Journal of the Medical Library Association JMLA ◽

10.5195/jmla.2017.47 ◽

2017 ◽

Vol 105 (3) ◽

Author(s):

Lori Gawdyda ◽

Kimbroe Carter ◽

Mark Willson ◽

Denise Bedford

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Portable Document Format ◽

Digital Curation ◽

Medical Library ◽

Search Service ◽

Medical Index ◽

Nineteenth And Twentieth Centuries ◽

Medical Articles

Background: Harold Jeghers, a well-known medical educator of the twentieth century, maintained a print collection of about one million medical articles from the late 1800s to the 1990s. This case study discusses how a print collection of these articles was transformed to a digital database.Case Presentation: Staff in the Jeghers Medical Index, St. Elizabeth Youngstown Hospital, converted paper articles to Adobe portable document format (PDF)/A-1a files. Optical character recognition was used to obtain searchable text. The data were then incorporated into a specialized database. Lastly, articles were matched to PubMed bibliographic metadata through automation and human review. An online database of the collection was ultimately created. The collection was made part of a discovery search service, and semantic technologies have been explored as a method of creating access points.Conclusions: This case study shows how a small medical library made medical writings of the nineteenth and twentieth centuries available in electronic format for historic or semantic research, highlighting the efficiencies of contemporary information technology.

Download Full-text

Case Study IV: Optical Character Recognition

Neural Network PC Tools ◽

10.1016/b978-0-12-228640-7.50019-2 ◽

1990 ◽

pp. 285-294

Author(s):

Gary Entsminger

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Download Full-text

Analysis of OCR Design and Implementation for GUI Modeling

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.128-129.1303 ◽

2011 ◽

Vol 128-129 ◽

pp. 1303-1307

Author(s):

Yu Mei Wu ◽

Zhi Fang Liu

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Work Load ◽

Test Case ◽

New Approach ◽

Design And Implementation ◽

Optical Character ◽

Automated Test Case Generation ◽

Model Based Testing

Many efforts have been taken to achieve automated Graphical User Interface (GUI) testing. The most popular way is model-based testing, which supports automated test case generation and execution. But building such a model is a non-trivial task, which usually costs the most work-load in the entire testing process. Most of the approaches about automated model deriving are dependant on the programming language or specific OS. In this paper, we proposed a new approach of GUI modeling using Optical Character Recognition (OCR), and technical poblems encountered have been analyzed in deatail. Case study shows that our approach is capable of analyzing most of the GUI windows, and generating corresponding model and hence eliminates the above constraint.

Download Full-text

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Digital Scholarship in the Humanities ◽

10.1093/llc/fqz024 ◽

2019 ◽

Vol 34 (4) ◽

pp. 825-843 ◽

Cited By ~ 3

Author(s):

Mark J Hill ◽

Simon Hengchen

Keyword(s):

Eighteenth Century ◽

Text Analysis ◽

Character Recognition ◽

Optical Character Recognition ◽

Ground Truth ◽

Optical Character ◽

Historical Text ◽

The Impact

Abstract This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.

Download Full-text

Optical Character Recognition using Ant Miner Algorithm: A Case Study on Oriya Character Recognition

International Journal of Computer Applications ◽

10.5120/9908-4500 ◽

2013 ◽

Vol 61 (3) ◽

pp. 17-22 ◽

Cited By ~ 4

Author(s):

Bhagirath Kumar ◽

Niraj Kumar ◽

Charulata Palai ◽

Pradeep Kumar Jena ◽

Subhagata Chattopadhyay

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Download Full-text

Optical Character Recognition by Open source OCR Tool Tesseract: A Case Study

International Journal of Computer Applications ◽

10.5120/8794-2784 ◽

2012 ◽

Vol 55 (10) ◽

pp. 50-56 ◽

Cited By ~ 69

Author(s):

Chirag Patel ◽

Atul Patel ◽

Dharmendra Patel

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Download Full-text

TreatmentBank: Plazi's strategies and its implementation to most efficiently liberate data from scholarly publications

Biodiversity Information Science and Standards ◽

10.3897/biss.5.75690 ◽

2021 ◽

Vol 5 ◽

Author(s):

Marcus Guidoti ◽

Carolina Sokolowicz ◽

Felipe Simoes ◽

Valdenar Gonçalves ◽

Tatiana Ruschel ◽

...

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Data Extraction ◽

Batch Process ◽

Portable Document Format ◽

Extraction Process ◽

Current Data ◽

Cloud Provider ◽

Efficient Operation ◽

Granularity Level

Plazi's TreatmentBank is a research infrastructure and partner of the recent European Union-funded Biodiversity Community Integrated Knowledge Library (BiCIKL) project to provide a single knowledge portal to open, interlinked and machine-readable, findable, accessible, interoperable and reusable (FAIR) data. Plazi is liberating published biodiversity data that is trapped in so-called flat formats, such as portable document format (PDF), to increase its FAIRness. This can pose a variety of challenges for both data mining and curation of the extracted data. The automation of such a complex process requires internal organization and a well established workflow of specific steps (e.g., decoding of the PDF, extraction of data) to handle the challenges that the immense variety of graphic layouts existing in the biodiversity publishing landscape can impose. These challenges may vary according to the origin of the document: scanned documents that were not initially digital, need optical character recognition in order to be processed. Processing a document can either be an individual, one-time-only process, or a batch process, in which a template for a specific document type must be produced. Templates consist of a set of parameters that tell Plazi-dedicated software how to read and where to find key pieces of information for the extraction process, such as the related metadata. These parameters aim to improve the outcome of the data extraction process, and lead to more consistent results than manual extraction. In order to produce such templates, a set of tests and accompanying statistics are evaluated, and these same statistics are constantly checked against ongoing processing tasks in order to assess the template performance in a continuous manner. In addition to these steps that are intrinsically associated with the automated process, different granularity levels (e.g., low granularity level might consist of a treatment and its subsections versus a high granularity level that includes material citations down to named entities such as collection codes, collector, collecting date) were defined to accommodate specific needs for particular projects and user requirements. The higher the granularity level, the more thoroughly checked the resulting data is expected to be. Additionally, steps related to the quality control (qc), such as the “pre-qc”, “qc” and “extended qc” were designed and implemented to ensure data quality and enhanced data accuracy. Data on all these different stages of the processing workflow are constantly being collected and assessed in order to improve these very same stages, aiming for a more reliable and efficient operation. This is also associated with a current Data Architecture plan to move this data assessment to a cloud provider to promote real-time assessment and constant analyses of template performance and processing stages as a whole. In this talk, the steps of this entire process are explained in detail, highlighting how data are being used to improve these steps towards a more efficient, accurate, and less costly operation.

Download Full-text

THE COMPARISONS OF OCR TOOLS: A CONVERSION CASE IN THE MALAYSIAN HANSARD CORPUS DEVELOPMENT

MALAYSIAN JOURNAL OF COMPUTING ◽

10.24191/mjoc.v4i2.5626 ◽

2019 ◽

Vol 4 (2) ◽

pp. 335

Author(s):

Anis Nadiah Che Abdul Rahman ◽

Imran Ho Abdullah ◽

Intan Safinaz Zainuddin ◽

Azhar Jaludin

Keyword(s):

Error Rate ◽

Character Recognition ◽

Optical Character Recognition ◽

Computer Software ◽

Portable Document Format ◽

Error Rates ◽

Plain Text ◽

Machine Readable ◽

Conversion Tool ◽

Computational Technology

Optical Character Recognition (OCR) is a tool in computational technology that allows a recognition of printed characters by manipulating photoelectric devices and computer software. It runs by converting images or texts that are scanned beforehand into machine-readable and editable texts. There are a various numbers of OCR tools in the market for commercial and research use, which are obtainable for free or restrained with purchases. An OCR tool is able to enhance the accuracy of the results which as well relies on pre-processing and subdivision of algorithms. This study intends to investigate the performances of OCR tools in converting the Parliamentary Reports of Hansard Malaysia for developing the Malaysian Hansard Corpus (MHC). By comparing four OCR tools, the study has converted ten reports of Parliamentary Reports which contains a number of 62 pages to see the conversion accuracy and error rate of each conversion tool. In this study, all of the tools are manipulated to convert Adobe Portable Document Format (PDF) files into Plain Text File (txt). The objective of this study is to give an overview based on accuracy and error rate of how each OCR tools essentially works and how it can be utilized to provide assistance towards corpus building. The study indicates that each tool possesses a variety of accuracy and error rates to convert the whole documents from PDF into txt or plain text files. The study proposes that a step of corpus building can be made easier and manageable when a researcher understands the way an OCR tool works in order to choose the best OCR tool prior to the outset of the corpus development.

Download Full-text

SELECTION TECHNIQUE FOR MULTIPLE OUTPUTS OF OPTICAL CHARACTER RECOGNITION

Eurasian Journal of Mathematical and Computer Applications ◽

10.32523/2306-6172-2020-8-2-41-51 ◽

2020 ◽

Vol 8 (2) ◽

pp. 41-51

Author(s):

I.Q. Habeeb ◽

Z.Q. Al-Zaydi ◽

H.N. Abdulkhudhur

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Selection Technique ◽

Multiple Outputs ◽

Optical Character

Download Full-text

A Structured Method for the Recognition of Complex Historical Tables

History and Computing ◽

10.3366/hac.1997.9.1-3.58 ◽

1997 ◽

Vol 9 (1-3) ◽

pp. 58-77

Author(s):

Vitaly Kliatskine ◽

Eugene Shchepin ◽

Gunnar Thorvaldsen ◽

Konstantin Zingerman ◽

Valery Lazarev

Keyword(s):

Nineteenth Century ◽

Character Recognition ◽

Optical Character Recognition ◽

Complex Structure ◽

Source Material ◽

Historical Sources ◽

Tax Assessment ◽

Optical Character ◽

Algorithmic Model ◽

Machine Readable

In principle, printed source material should be made machine-readable with systems for Optical Character Recognition, rather than being typed once more. Offthe-shelf commercial OCR programs tend, however, to be inadequate for lists with a complex layout. The tax assessment lists that assess most nineteenth century farms in Norway, constitute one example among a series of valuable sources which can only be interpreted successfully with specially designed OCR software. This paper considers the problems involved in the recognition of material with a complex table structure, outlining a new algorithmic model based on ‘linked hierarchies’. Within the scope of this model, a variety of tables and layouts can be described and recognized. The ‘linked hierarchies’ model has been implemented in the ‘CRIPT’ OCR software system, which successfully reads tables with a complex structure from several different historical sources.

Download Full-text

CNN-based Rain Reduction in Street View Images

London Imaging Meeting ◽

10.2352/issn.2694-118x.2020.lim-12 ◽

2020 ◽

Vol 2020 (1) ◽

pp. 78-81

Author(s):

Simone Zini ◽

Simone Bianco ◽

Raimondo Schettini

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

State Of The Art ◽

Weather Conditions ◽

Specific Interest ◽

Optical Character ◽

Street View ◽

In The Wild ◽

Bad Weather ◽

Detection And Recognition

Rain removal from pictures taken under bad weather conditions is a challenging task that aims to improve the overall quality and visibility of a scene. The enhanced images usually constitute the input for subsequent Computer Vision tasks such as detection and classification. In this paper, we present a Convolutional Neural Network, based on the Pix2Pix model, for rain streaks removal from images, with specific interest in evaluating the results of the processing operation with respect to the Optical Character Recognition (OCR) task. In particular, we present a way to generate a rainy version of the Street View Text Dataset (R-SVTD) for "text detection and recognition" evaluation in bad weather conditions. Experimental results on this dataset show that our model is able to outperform the state of the art in terms of two commonly used image quality metrics, and that it is capable to improve the performances of an OCR model to detect and recognise text in the wild.

Download Full-text