Can Morphological Analyzers Improve the Quality of Optical Character Recognition?

Optical Character Recognition (OCR) can substantially improve the usability of digitized documents. Language modeling using word lists is known to improve OCR quality for English. For morphologically rich languages, however, even large word lists do not reach high coverage on unseen text. Morphological analyzers offer a more sophisticated approach, which is useful in many language processing applications. is paper investigates language modeling in the open-source OCR engine Tesseract using morphological analyzers. We present experiments on two Uralic languages Finnish and Erzya. According to our experiments, word lists may still be superior to morphological analyzers in OCR even for languages with rich morphology. Our error analysis indicates that morphological analyzers can cause a large amount of real word OCR errors.

Download Full-text

Kurzweil Reading Machine: A Partial Evaluation of Its Optical Character Recognition Error Rate

Journal of Visual Impairment & Blindness ◽

10.1177/0145482x7907301002 ◽

1979 ◽

Vol 73 (10) ◽

pp. 389-399

Author(s):

Gregory L. Goodrich ◽

Richard R. Bennett ◽

William R. De L'aune ◽

Harvey Lauer ◽

Leonard Mowinski

Keyword(s):

Error Rate ◽

Character Recognition ◽

Optical Character Recognition ◽

Partial Evaluation ◽

Error Rates ◽

Recognition Error ◽

Optical Character ◽

Printed Materials ◽

High Level

This study was designed to assess the Kurzweil Reading Machine's ability to read three different type styles produced by five different means. The results indicate that the Kurzweil Reading Machines tested have different error rates depending upon the means of producing the copy and upon the type style used; there was a significant interaction between copy method and type style. The interaction indicates that some type styles are better read when the copy is made by one means rather than another. Error rates varied between less than one percent and more than twenty percent. In general, the user will find that high quality printed materials will be read with a relatively high level of accuracy, but as the quality of the material decreases, the number of errors made by the machine also increases. As this error rate increases, the user will find it increasingly difficult to understand the spoken output.

Download Full-text

A Finding Aid for The Equity

Inquiry@Queen's Undergraduate Research Conference Proceedings ◽

10.24908/iqurcp.11597 ◽

2018 ◽

Author(s):

Jeff Blackadar

Keyword(s):

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Large Collection ◽

Digital History ◽

R Language ◽

Potential Value ◽

Optical Character ◽

Text Searching ◽

Person Location

Bibliothèque et Archives Nationales du Québec digitally scanned and converted to text a large collection of newspapers to create a resource of tremendous potential value to historians. Unfortunately, the text files are difficult to search reliably due to many errors caused by the optical character recognition (OCR) text conversion process. This digital history project applied natural language processing in an R language computer program to create a new and useful index of this corpus of digitized content despite OCR related errors. The project used editions of The Equity, published in Shawville, Quebec since 1883. The program extracted the names of all the person, location and organization entities that appeared in each edition. Each of the entities was cataloged in a database and related to the edition of the newspaper it appeared in. The database was published to a public website to allow other researchers to use it. The resulting index or finding aid allows researchers to access The Equity in a different way than just full text searching. People, locations and organizations appearing in the Equity are listed on the website and each entity links to a page that lists all of the issues that entity appeared in as well as the other entities that may be related to it. Rendering the text files of each scanned newspaper into entities and indexing them in a database allows the content of the newspaper to be interacted with by entity name and type rather than just a set of large text files. Website: http://www.jeffblackadar.ca/graham_fellowship/corpus_entities_equity/

Download Full-text

Optical character recognition errors and their effects on natural language processing

International Journal on Document Analysis and Recognition (IJDAR) ◽

10.1007/s10032-009-0094-8 ◽

2009 ◽

Vol 12 (3) ◽

pp. 141-151 ◽

Cited By ~ 26

Author(s):

Daniel Lopresti

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character ◽

Recognition Errors

Download Full-text

Optical character recognition errors and their effects on natural language processing

Proceedings of the second workshop on Analytics for noisy unstructured text data - AND '08 ◽

10.1145/1390749.1390753 ◽

2008 ◽

Cited By ~ 10

Author(s):

Daniel Lopresti

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character ◽

Recognition Errors

Download Full-text

Image Spam Detection Using Machine Learning and Natural Language Processing

Journal of Southwest Jiaotong University ◽

10.35741/issn.0258-2724.55.2.41 ◽

2020 ◽

Vol 55 (2) ◽

Author(s):

Yaseen Khather Yaseen ◽

Alaa Khudhair Abbas ◽

Ahmed M. Sana

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Optical Character ◽

Harmful Content

Today, images are a part of communication between people. However, images are being used to share information by hiding and embedding messages within it, and images that are received through social media or emails can contain harmful content that users are not able to see and therefore not aware of. This paper presents a model for detecting spam on images. The model is a combination of optical character recognition, natural language processing, and the machine learning algorithm. Optical character recognition extracts the text from images, and natural language processing uses linguistics capabilities to detect and classify the language, to distinguish between normal text and slang language. The features for selected images are then extracted using the bag-of-words model, and the machine learning algorithm is run to detect any kind of spam that may be on it. Finally, the model can predict whether or not the image contains any harmful content. The results show that the proposed method using a combination of the machine learning algorithm, optical character recognition, and natural language processing provides high detection accuracy compared to using machine learning alone.

Download Full-text

APPLICATION OF ZONAL AND CURVATURE FEATURES TO NUMERALS RECOGNITION

International Journal of Students Research in Technology & Management ◽

10.18510/ijsrtm.2021.922 ◽

2021 ◽

Vol 9 (2) ◽

pp. 7-12

Author(s):

Binod Kumar Prasad

Keyword(s):

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Recognition Rate ◽

Recognition System ◽

Signature Verification ◽

Optical Character ◽

Knn Classifier ◽

Average Recognition Rate ◽

Distance Coding

Purpose of the study: The purpose of this work is to present an offline Optical Character Recognition system to recognise handwritten English numerals to help automation of document reading. It helps to avoid tedious and time-consuming manual typing to key in important information in a computer system to preserve it for a longer time. Methodology: This work applies Curvature Features of English numeral images by encoding them in terms of distance and slope. The finer local details of images have been extracted by using Zonal features. The feature vectors obtained from the combination of these features have been fed to the KNN classifier. The whole work has been executed using the MatLab Image Processing toolbox. Main Findings: The system produces an average recognition rate of 96.67% with K=1 whereas, with K=3, the rate increased to 97% with corresponding errors of 3.33% and 3% respectively. Out of all the ten numerals, some numerals like ‘3’ and ‘8’ have shown respectively lower recognition rates. It is because of the similarity between their structures. Applications of this study: The proposed work is related to the recognition of English numerals. The model can be used widely for recognition of any pattern like signature verification, face recognition, character or word recognition in another language under Natural Language Processing, etc. Novelty/Originality of this study: The novelty of the work lies in the process of feature extraction. Curves present in the structure of a numeral sample have been encoded based on distance and slope thereby presenting Distance features and Slope features. Vertical Delta Distance Coding (VDDC) and Horizontal Delta Distance Coding (HDDC) encode a curve from vertical and horizontal directions to reveal concavity and convexity from different angles.

Download Full-text

Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports

Gastrointestinal Endoscopy ◽

10.1016/j.gie.2020.08.038 ◽

2020 ◽

Author(s):

Sobia Nasir Laique ◽

Umar Hayat ◽

Shashank Sarvepalli ◽

Byron Vaughn ◽

Mounir Ibrahim ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Large Scale ◽

Data Extraction ◽

Quality Metric ◽

Optical Character

Download Full-text

IMPROVEMENT OF THE COLOR TEXT IMAGE BINARIZATION METHOD USING THE MINIMUM-DISTANCE CLASSIFIER

Applied Aspects of Information Technology ◽

10.15276/aait.01.2021.5 ◽

2021 ◽

Vol 4 (1) ◽

pp. 57-70

Author(s):

Marina V. Polyakova ◽

Alexandr G. Nesteryuk

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Minimum Distance ◽

Color Images ◽

Connected Components ◽

Image Binarization ◽

Optical Character ◽

Binarization Method ◽

Text Images

Optical character recognition systems for the images are used to convert books and documents into electronic form, to automate accounting systems in business, when recognizing markers using augmented reality technologies and etс. The quality of optical character recognition, provided that binarization is applied, is largely determined by the quality of separation of the foreground pixels from the background. Methods of text image binarization are analyzed and insufficient quality of binarization is noted. As a way of research the minimum-distance classifier for the improvement of the existing method of binarization of color text images is used. To improve the quality of the binarization of color text images, it is advisable to divide image pixels into two classes, “Foreground” and “Background”, to use classification methods instead of heuristic threshold selection, namely, a minimum-distance classifier. To reduce the amount of processed information before applying the classifier, it is advisable to select blocks of pixels for subsequent processing. This was done by analyzing the connected components on the original image. An improved method of the color text image binarization with the use of analysis of connected components and minimum-distance classifier has been elaborated. The research of the elaborated method showed that it is better than existing binarization methods in terms of robustness of binarization, but worse in terms of the error of the determining the boundaries of objects. Among the recognition errors, the pixels of images from the class labeled “Foreground” were more often mistaken for the class labeled “Background”. The proposed method of binarization with the uniqueness of class prototypes is recommended to be used in problems of the processing of color images of the printed text, for which the error in determining the boundaries of characters as a result of binarization is compensated by the thickness of the letters. With a multiplicity of class prototypes, the proposed binarization method is recommended to be used in problems of processing color images of handwritten text, if high performance is not required. The improved binarization method has shown its efficiency in cases of slow changes in the color and illumination of the text and background, however, abrupt changes in color and illumination, as well as a textured background, do not allowing the binarization quality required for practical problems.

Download Full-text

Effects of Ambient Light, Camcorder Settings, and Automated License Plate Reader Settings on Plate Transcription Rates

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/1804-08 ◽

2002 ◽

Vol 1804 (1) ◽

pp. 56-61 ◽

Cited By ~ 1

Author(s):

Michael Plotnikov ◽

Paul W. Shuldiner

Keyword(s):

Character Recognition ◽

Template Matching ◽

Optical Character Recognition ◽

License Plate ◽

Ambient Light ◽

Light Conditions ◽

Video Images ◽

Optical Character ◽

Plate Reader

The ability of an automated license plate reading (ALPR) system to convert video images of license plates into computer records depends on many factors. Of these, two are readily controlled by the operator: the quality of the video images captured in the field and the internal settings of the ALPR used to transcribe these images. A third factor, the light conditions under which the license plate images are acquired, is less easily managed, especially when camcorders are used in the field under ambient light conditions. A set of experiments was conducted to test the effects of ambient light conditions, video camcorder adjustments, and internal ALPR settings on the percent of correct reads attained by a specific type of ALPR, one whose optical character recognition process is based on template matching. Images of rear license plates were collected under four ambient light conditions: overcast with no shadows, and full sunlight with the sun in front of the camcorder, behind the camcorder, and orthogonal to the line of sight. Three camcorder exposure settings were tested. Two of the settings made use of the camcorder’s internal light meter, and the third relied solely on operator judgment. The license plates read ranged from 41% to 72%, depending most strongly on ambient light conditions. In all cases, careful adjustment of the ALPR led to significantly improved read rates over those obtained by using the manufacturer’s recommended default settings. Exposure settings based on the operator’s judgment worked best in all instances.

Download Full-text