scholarly journals OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

Author(s):  
Christian Reul ◽  
Dennis Christ ◽  
Alexander Hartelt ◽  
Nico Balbach ◽  
Maximilian Wehner ◽  
...  

Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. A comfortable GUI allows error corrections not only in the final output, but already in early stages to minimize error propagations. Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary. Experiments showed that users with minimal or no experience were able to capture the text of even the earliest printed books with manageable effort and great quality, achieving excellent character error rates (CERs) below 0.5%. The fully automated application on 19th century novels showed that OCR4all can considerably outperform the commercial state-of-the-art tool ABBYY Finereader on moderate layouts if suitably pretrained mixed OCR models are available. The architecture of OCR4all allows the easy integration (or substitution) of newly developed tools for its main components by standardized interfaces like PageXML, thus aiming at continual higher automation for historical printings.

2019 ◽  
Vol 9 (22) ◽  
pp. 4853 ◽  
Author(s):  
Christian Reul ◽  
Dennis Christ ◽  
Alexander Hartelt ◽  
Nico Balbach ◽  
Maximilian Wehner ◽  
...  

Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years, great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout analysis and segmentation, character recognition, and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper, we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. While a variety of materials can already be processed fully automatically, books with more complex layouts require manual intervention by the users. This is mostly due to the fact that the required ground truth for training stronger mixed models (for segmentation, as well as text recognition) is not available, yet, neither in the desired quantity nor quality. To deal with this issue in the short run, OCR4all offers a comfortable GUI that allows error corrections not only in the final output, but already in early stages to minimize error propagations. In the long run, this constant manual correction produces large quantities of valuable, high quality training material, which can be used to improve fully automatic approaches. Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary. During experiments, the fully automated application on 19th Century novels showed that OCR4all can considerably outperform the commercial state-of-the-art tool ABBYY Finereader on moderate layouts if suitably pretrained mixed OCR models are available. Furthermore, on very complex early printed books, even users with minimal or no experience were able to capture the text with manageable effort and great quality, achieving excellent Character Error Rates (CERs) below 0.5%. The architecture of OCR4all allows the easy integration (or substitution) of newly developed tools for its main components by standardized interfaces like PageXML, thus aiming at continual higher automation for historical printings.


2020 ◽  
Vol 2020 (1) ◽  
pp. 78-81
Author(s):  
Simone Zini ◽  
Simone Bianco ◽  
Raimondo Schettini

Rain removal from pictures taken under bad weather conditions is a challenging task that aims to improve the overall quality and visibility of a scene. The enhanced images usually constitute the input for subsequent Computer Vision tasks such as detection and classification. In this paper, we present a Convolutional Neural Network, based on the Pix2Pix model, for rain streaks removal from images, with specific interest in evaluating the results of the processing operation with respect to the Optical Character Recognition (OCR) task. In particular, we present a way to generate a rainy version of the Street View Text Dataset (R-SVTD) for "text detection and recognition" evaluation in bad weather conditions. Experimental results on this dataset show that our model is able to outperform the state of the art in terms of two commonly used image quality metrics, and that it is capable to improve the performances of an OCR model to detect and recognise text in the wild.


2014 ◽  
Vol 60 (3) ◽  
pp. 463-470 ◽  
Author(s):  
Charles D Hawker ◽  
William McCarthy ◽  
David Cleveland ◽  
Bonnie L Messinger

Abstract BACKGROUND Mislabeled samples are a serious problem in most clinical laboratories. Published error rates range from 0.39/1000 to as high as 1.12%. Standardization of bar codes and label formats has not yet achieved the needed improvement. The mislabel rate in our laboratory, although low compared with published rates, prompted us to seek a solution to achieve zero errors. METHODS To reduce or eliminate our mislabeled samples, we invented an automated device using 4 cameras to photograph the outside of a sample tube. The system uses optical character recognition (OCR) to look for discrepancies between the patient name in our laboratory information system (LIS) vs the patient name on the customer label. All discrepancies detected by the system's software then require human inspection. The system was installed on our automated track and validated with production samples. RESULTS We obtained 1 009 830 images during the validation period, and every image was reviewed. OCR passed approximately 75% of the samples, and no mislabeled samples were passed. The 25% failed by the system included 121 samples actually mislabeled by patient name and 148 samples with spelling discrepancies between the patient name on the customer label and the patient name in our LIS. Only 71 of the 121 mislabeled samples detected by OCR were found through our normal quality assurance process. CONCLUSIONS We have invented an automated camera system that uses OCR technology to identify potential mislabeled samples. We have validated this system using samples transported on our automated track. Full implementation of this technology offers the possibility of zero mislabeled samples in the preanalytic stage.


1979 ◽  
Vol 73 (10) ◽  
pp. 389-399
Author(s):  
Gregory L. Goodrich ◽  
Richard R. Bennett ◽  
William R. De L'aune ◽  
Harvey Lauer ◽  
Leonard Mowinski

This study was designed to assess the Kurzweil Reading Machine's ability to read three different type styles produced by five different means. The results indicate that the Kurzweil Reading Machines tested have different error rates depending upon the means of producing the copy and upon the type style used; there was a significant interaction between copy method and type style. The interaction indicates that some type styles are better read when the copy is made by one means rather than another. Error rates varied between less than one percent and more than twenty percent. In general, the user will find that high quality printed materials will be read with a relatively high level of accuracy, but as the quality of the material decreases, the number of errors made by the machine also increases. As this error rate increases, the user will find it increasingly difficult to understand the spoken output.


2019 ◽  
Vol 277 ◽  
pp. 02030 ◽  
Author(s):  
Yuncong Lu

Handwriting capitalization recognition is a function of distinguishing handwritten capital letters by means of machine or computer intelligence, which is classified into the field of optical character recognition. Given that capital letters are widely used around the world, identification and analysis are often used as the main components of some control systems. Therefore, the research on handwritten capital letter recognition is also very practical and has important practical significance. The key part of the research contained in this paper is the image preprocessing and the optimal selection of feature vectors, and finally completes the design of handwritten digit recognition system. In this paper, the Fourier and Bayesian commonly used are compared, and eventually the Fourier transform feature is applied to the system classification identification. After completing the test on the relevant experimental data, the results show that the handwritten capital recognition system established in this paper has a high recognition accuracy for handwritten capital letters after repeated training.


2020 ◽  
Vol 17 (9) ◽  
pp. 4267-4275
Author(s):  
Jagadish Kallimani ◽  
Chandrika Prasad ◽  
D. Keerthana ◽  
Manoj J. Shet ◽  
Prasada Hegde ◽  
...  

Optical character recognition is the process of conversion of images of text into machine-encoded text electronically or mechanically. The text on image can be handwritten, typed or printed. Some of the examples of image source can be a picture of a document, a scanned document or a text which is superimposed on an image. Most optical character recognition system does not give a 100% accurate result. This project aims at analyzing the error rate of a few open source optical character recognition systems (Boxoft OCR, ABBY, Tesseract, Free Online OCR etc.) on a set of diverse documents and makes a comparative study of the same. By this, we can study which OCR is the best suited for a document.


2012 ◽  
Vol 68 (5) ◽  
pp. 659-683 ◽  
Author(s):  
Tobias Blanke ◽  
Michael Bryant ◽  
Mark Hedges

2019 ◽  
Vol 22 (1) ◽  
pp. 109-192
Author(s):  
Emily Chesley ◽  
Jillian Marcantonio ◽  
Abigail Pearson

Abstract This paper summarizes the results of an extensive test of Tesseract 4.0, an open-source Optical Character Recognition (OCR) engine with Syriac capabilities, and ascertains the current state of Syriac OCR technology. Three popular print types (S14, W64, and E22) representing the Syriac type styles Estrangela, Serto, and East Syriac were OCRed using Tesseract’s two different OCR modes (Syriac Language and Syriac Script). Handwritten manuscripts were also preliminarily tested for OCR. The tests confirm that Tesseract 4.0 may be relied upon for printed Estrangela texts but should be used with caution and human revision for Serto and East Syriac printed texts. Consonantal accuracy lies around 99% for Estrangela, between 89% and 94% for Serto, and around 89% for East Syriac. Scholars may use Tesseract to OCR Estrangela texts with a high degree of confidence, but further training of the engine will be required before Serto and East Syriac texts can be smoothly OCRed. In all type styles, human revision of the OCRed text is recommended when scholars desire an exact, error-free corpus.


1998 ◽  
Vol 10 (8) ◽  
pp. 2175-2200 ◽  
Author(s):  
Holger Schwenk

We present a new classification architecture based on autoassociative neural networks that are used to learn discriminant models of each class. The proposed architecture has several interesting properties with respect to other model-based classifiers like nearest-neighbors or radial basis functions: it has a low computational complexity and uses a compact distributed representation of the models. The classifier is also well suited for the incorporation of a priori knowledge by means of a problem-specific distance measure. In particular, we will show that tangent distance (Simard, Le Cun, & Denker, 1993) can be used to achieve transformation invariance during learning and recognition. We demonstrate the application of this classifier to optical character recognition, where it has achieved state-of-the-art results on several reference databases. Relations to other models, in particular those based on principal component analysis, are also discussed.


Sign in / Sign up

Export Citation Format

Share Document