scholarly journals THE COMPARISONS OF OCR TOOLS: A CONVERSION CASE IN THE MALAYSIAN HANSARD CORPUS DEVELOPMENT

2019 ◽  
Vol 4 (2) ◽  
pp. 335
Author(s):  
Anis Nadiah Che Abdul Rahman ◽  
Imran Ho Abdullah ◽  
Intan Safinaz Zainuddin ◽  
Azhar Jaludin

Optical Character Recognition (OCR) is a tool in computational technology that allows a recognition of printed characters by manipulating photoelectric devices and computer software. It runs by converting images or texts that are scanned beforehand into machine-readable and editable texts. There are a various numbers of OCR tools in the market for commercial and research use, which are obtainable for free or restrained with purchases. An OCR tool is able to enhance the accuracy of the results which as well relies on pre-processing and subdivision of algorithms. This study intends to investigate the performances of OCR tools in converting the Parliamentary Reports of Hansard Malaysia for developing the Malaysian Hansard Corpus (MHC). By comparing four OCR tools, the study has converted ten reports of Parliamentary Reports which contains a number of 62 pages to see the conversion accuracy and error rate of each conversion tool. In this study, all of the tools are manipulated to convert Adobe Portable Document Format (PDF) files into Plain Text File (txt). The objective of this study is to give an overview based on accuracy and error rate of how each OCR tools essentially works and how it can be utilized to provide assistance towards corpus building. The study indicates that each tool possesses a variety of accuracy and error rates to convert the whole documents from PDF into txt or plain text files. The study proposes that a step of corpus building can be made easier and manageable when a researcher understands the way an OCR tool works in order to choose the best OCR tool prior to the outset of the corpus development.

1979 ◽  
Vol 73 (10) ◽  
pp. 389-399
Author(s):  
Gregory L. Goodrich ◽  
Richard R. Bennett ◽  
William R. De L'aune ◽  
Harvey Lauer ◽  
Leonard Mowinski

This study was designed to assess the Kurzweil Reading Machine's ability to read three different type styles produced by five different means. The results indicate that the Kurzweil Reading Machines tested have different error rates depending upon the means of producing the copy and upon the type style used; there was a significant interaction between copy method and type style. The interaction indicates that some type styles are better read when the copy is made by one means rather than another. Error rates varied between less than one percent and more than twenty percent. In general, the user will find that high quality printed materials will be read with a relatively high level of accuracy, but as the quality of the material decreases, the number of errors made by the machine also increases. As this error rate increases, the user will find it increasingly difficult to understand the spoken output.


1997 ◽  
Vol 9 (1-3) ◽  
pp. 58-77
Author(s):  
Vitaly Kliatskine ◽  
Eugene Shchepin ◽  
Gunnar Thorvaldsen ◽  
Konstantin Zingerman ◽  
Valery Lazarev

In principle, printed source material should be made machine-readable with systems for Optical Character Recognition, rather than being typed once more. Offthe-shelf commercial OCR programs tend, however, to be inadequate for lists with a complex layout. The tax assessment lists that assess most nineteenth century farms in Norway, constitute one example among a series of valuable sources which can only be interpreted successfully with specially designed OCR software. This paper considers the problems involved in the recognition of material with a complex table structure, outlining a new algorithmic model based on ‘linked hierarchies’. Within the scope of this model, a variety of tables and layouts can be described and recognized. The ‘linked hierarchies’ model has been implemented in the ‘CRIPT’ OCR software system, which successfully reads tables with a complex structure from several different historical sources.


2014 ◽  
Vol 60 (3) ◽  
pp. 463-470 ◽  
Author(s):  
Charles D Hawker ◽  
William McCarthy ◽  
David Cleveland ◽  
Bonnie L Messinger

Abstract BACKGROUND Mislabeled samples are a serious problem in most clinical laboratories. Published error rates range from 0.39/1000 to as high as 1.12%. Standardization of bar codes and label formats has not yet achieved the needed improvement. The mislabel rate in our laboratory, although low compared with published rates, prompted us to seek a solution to achieve zero errors. METHODS To reduce or eliminate our mislabeled samples, we invented an automated device using 4 cameras to photograph the outside of a sample tube. The system uses optical character recognition (OCR) to look for discrepancies between the patient name in our laboratory information system (LIS) vs the patient name on the customer label. All discrepancies detected by the system's software then require human inspection. The system was installed on our automated track and validated with production samples. RESULTS We obtained 1 009 830 images during the validation period, and every image was reviewed. OCR passed approximately 75% of the samples, and no mislabeled samples were passed. The 25% failed by the system included 121 samples actually mislabeled by patient name and 148 samples with spelling discrepancies between the patient name on the customer label and the patient name in our LIS. Only 71 of the 121 mislabeled samples detected by OCR were found through our normal quality assurance process. CONCLUSIONS We have invented an automated camera system that uses OCR technology to identify potential mislabeled samples. We have validated this system using samples transported on our automated track. Full implementation of this technology offers the possibility of zero mislabeled samples in the preanalytic stage.


Author(s):  
Christian Reul ◽  
Dennis Christ ◽  
Alexander Hartelt ◽  
Nico Balbach ◽  
Maximilian Wehner ◽  
...  

Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. A comfortable GUI allows error corrections not only in the final output, but already in early stages to minimize error propagations. Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary. Experiments showed that users with minimal or no experience were able to capture the text of even the earliest printed books with manageable effort and great quality, achieving excellent character error rates (CERs) below 0.5%. The fully automated application on 19th century novels showed that OCR4all can considerably outperform the commercial state-of-the-art tool ABBYY Finereader on moderate layouts if suitably pretrained mixed OCR models are available. The architecture of OCR4all allows the easy integration (or substitution) of newly developed tools for its main components by standardized interfaces like PageXML, thus aiming at continual higher automation for historical printings.


2021 ◽  
Vol 48 (2) ◽  
Author(s):  
Pooja Jain ◽  
◽  
Dr. Kavita Taneja ◽  
Dr. Harmunish Taneja ◽  
◽  
...  

Optical Character Recognition (OCR) is a very active research area in many challenging fields like pattern recognition, natural language processing (NLP), computer vision, biomedical informatics, machine learning (ML), and artificial intelligence (AI). This computational technology extracts the text in an editable format (MS Word/Excel, text files, etc.) from PDF files, scanned or hand-written documents, images (photographs, advertisements, and alike), etc. for further processing and has been utilized in many real-world applications including banking, education, insurance, finance, healthcare and keyword-based search in documents, etc. Many OCR toolsets are available under various categories, including open-source, proprietary, and online services. This research paper provides a comparative study of various OCR toolsets considering a variety of parameters.


The vehicles playing the vital role in our day to day life for transport, and some of the vehicles violates the traffic rules are also increasing, vehicle theft, unnecessary entering into highly restricted areas, increased number of accidents lead to increase in the rate of crime slowly. The vehicle had its own identity it should be recognized which plays the major role in the world. For recognition of the vehicles which are used commonly in the field of safety and security system, LPDR plays a major role and the vehicle registration number is recognized at some certain distance accurately. License Plate recognition is the most efficient and cost effective technique used for detection and recognition purposes. Automatic license plate recognition (ALPR) is used for finding the location of the license plate in the vehicle. These methods and techniques vary based on the conditions like, quality of the image, vehicle on a fine-tuned position, effects of lighting, type of image, etc. The objective is to design an efficient automatic conveyance identification system of sanctioned or unauthorized in the residential societies by utilizing the conveyance number plate. By getting the car image from the surveillance camera in the entrance, we recognizing the number plate and the characters are extracted using OCR (optical character recognition). It converts the character in the image to plain text. Then the plain text of the license plate is cross-verified with the database to check whether the vehicle belongs to residents or visitor. It sends the alert message to the security official when a new visitor request method in a residential area. The log details are stored separately for the resident and visitor in the database. It also provides the details about the parking area availability in the residential area. By calculating the number of vehicles in and out of the area, the detail or availability parking slot is displayed and it sis robust to the size, lighting effects with high rate of detection.


2020 ◽  
Vol 17 (9) ◽  
pp. 4267-4275
Author(s):  
Jagadish Kallimani ◽  
Chandrika Prasad ◽  
D. Keerthana ◽  
Manoj J. Shet ◽  
Prasada Hegde ◽  
...  

Optical character recognition is the process of conversion of images of text into machine-encoded text electronically or mechanically. The text on image can be handwritten, typed or printed. Some of the examples of image source can be a picture of a document, a scanned document or a text which is superimposed on an image. Most optical character recognition system does not give a 100% accurate result. This project aims at analyzing the error rate of a few open source optical character recognition systems (Boxoft OCR, ABBY, Tesseract, Free Online OCR etc.) on a set of diverse documents and makes a comparative study of the same. By this, we can study which OCR is the best suited for a document.


2021 ◽  
Vol 9 (1) ◽  
pp. 19-34
Author(s):  
Mikko Tolonen ◽  
Eetu Mäkelä ◽  
Ali Ijaz ◽  
Leo Lahti

Eighteenth Century Collections Online (ECCO) is the most comprehensive dataset available in machine-readable form for eighteenth-century printed texts. It plays a crucial role in studies of eighteenth-century language and it has vast potential for corpus linguistics. At the same time, it is an unbalanced corpus that poses a series of different problems. The aim of this paper is to offer a general overview of ECCO for corpus linguistics by analysing, for example, its publication countries and languages. We will also analyse the role of the substantial number of reprints and new editions in the data, discuss genres and the estimates of Optical Character Recognition (OCR) quality. Our conclusion is that whereas ECCO provides a valuable source for corpus linguistics, scholars need to pay attention to historical source criticism. We have highlighted key aspects that need to be taken into consideration when considering its possible uses.


Author(s):  
HARRIS DRUCKER ◽  
ROBERT SCHAPIRE ◽  
PATRICE SIMARD

A boosting algorithm, based on the probably approximately correct (PAC) learning model is used to construct an ensemble of neural networks that significantly improves performance (compared to a single network) in optical character recognition (OCR) problems. The effect of boosting is reported on four handwritten image databases consisting of 12000 digits from segmented ZIP Codes from the United States Postal Service and the following from the National Institute of Standards and Technology: 220000 digits, 45000 upper case letters, and 45000 lower case letters. We use two performance measures: the raw error rate (no rejects) and the reject rate required to achieve a 1% error rate on the patterns not rejected. Boosting improved performance significantly, and, in some cases, dramatically.


Sign in / Sign up

Export Citation Format

Share Document