Optical character recognition errors and their effects on natural language processing

Author(s):  
Daniel Lopresti
Author(s):  
Yaseen Khather Yaseen ◽  
Alaa Khudhair Abbas ◽  
Ahmed M. Sana

Today, images are a part of communication between people. However, images are being used to share information by hiding and embedding messages within it, and images that are received through social media or emails can contain harmful content that users are not able to see and therefore not aware of. This paper presents a model for detecting spam on images. The model is a combination of optical character recognition, natural language processing, and the machine learning algorithm. Optical character recognition extracts the text from images, and natural language processing uses linguistics capabilities to detect and classify the language, to distinguish between normal text and slang language. The features for selected images are then extracted using the bag-of-words model, and the machine learning algorithm is run to detect any kind of spam that may be on it. Finally, the model can predict whether or not the image contains any harmful content. The results show that the proposed method using a combination of the machine learning algorithm, optical character recognition, and natural language processing provides high detection accuracy compared to using machine learning alone.


2020 ◽  
Vol 32 (2) ◽  
Author(s):  
Gideon Jozua Kotzé ◽  
Friedel Wolff

As more natural language processing (NLP) applications benefit from neural network based approaches, it makes sense to re-evaluate existing work in NLP. A complete pipeline for digitisation includes several components handling the material in sequence. Image processing after scanning the document has been shown to be an important factor in final quality. Here we compare two different approaches for visually enhancing documents before Optical Character Recognition (OCR), (1) a combination of ImageMagick and Unpaper and (2) OCRopus. We also compare Calamari, a new line-based OCR package using neural networks, with the well-known Tesseract 3 as the OCR component. Our evaluation on a set of Setswana documents reveals that the combination of ImageMagick/Unpaper and Calamari improves on a current baseline based on Tesseract 3 and ImageMagick/Unpaper with over 30%, achieving a mean character error rate of 1.69 across all combined test data.


2021 ◽  
pp. 56-61
Author(s):  
Rachna Tewani ◽  
◽  
◽  
◽  
◽  
...  

In today's world, everything is getting digitized, and widespread use of data scanning tools and photography. When we have a lot of image data, it becomes important to accumulate data in a form that is useful for the company/organization. Doing it manually is a tedious task and takes an ample amount of time. Hence to simplify the job, we have developed a FLASK API that takes an image folder as an object and returns an excel sheet of relevant data from the image data. We have used optical character recognition and software like pytesseract to extract data from images. Further in the process, we have used natural language processing, and finally, we have found relevant data using the globe and regex module. This model is helpful in data collection from Registration certificates which helps us store data like chassis number, owner name, car number, etc., easily and can be applied to Aadhaar cards and pan cards.


2021 ◽  
Author(s):  
Gian Maria Zaccaria ◽  
Vito Colella ◽  
Simona Colucci ◽  
Felice Clemente ◽  
Fabio Pavone ◽  
...  

BACKGROUND The unstructured nature of medical data from Real-World (RW) patients and the scarce accessibility for researchers to integrated systems restrain the use of RW information for clinical and translational research purposes. Natural Language Processing (NLP) might help in transposing unstructured reports in electronic health records (EHR), thus prompting their standardization and sharing. OBJECTIVE We aimed at designing a tool to capture pathological features directly from hemo-lymphopathology reports and automatically record them into electronic case report forms (eCRFs). METHODS We exploited Optical Character Recognition and NLP techniques to develop a web application, named ARGO (Automatic Record Generator for Oncology), that recognizes unstructured information from diagnostic paper-based reports of diffuse large B-cell lymphomas (DLBCL), follicular lymphomas (FL), and mantle cell lymphomas (MCL). ARGO was programmed to match data with standard diagnostic criteria of the National Institute of Health, automatically assign diagnosis and, via Application Programming Interface, populate specific eCRFs on the REDCap platform, according to the College of American Pathologists templates. A selection of 239 reports (n. 106 DLBCL, n.79 FL, and n. 54 MCL) from the Pathology Unit at the IRCCS - Istituto Tumori “Giovanni Paolo II” of Bari (Italy) was used to assess ARGO performance in terms of accuracy, precision, recall and F1-score. RESULTS By applying our workflow, we successfully converted 233 paper-based reports into corresponding eCRFs incorporating structured information about diagnosis, tissue of origin and anatomical site of the sample, major molecular markers and cell-of-origin subtype. Overall, ARGO showed high performance (nearly 90% of accuracy, precision, recall and F1-score) in capturing identification report number, biopsy date, specimen type, diagnosis, and additional molecular features. CONCLUSIONS We developed and validated an easy-to-use tool that converts RW paper-based diagnostic reports of major lymphoma subtypes into structured eCRFs. ARGO is cheap, feasible, and easily transferable into the daily practice to generate REDCap-based EHR for clinical and translational research purposes.


Author(s):  
Jeff Blackadar

Bibliothèque et Archives Nationales du Québec digitally scanned and converted to text a large collection of newspapers to create a resource of tremendous potential value to historians. Unfortunately, the text files are difficult to search reliably due to many errors caused by the optical character recognition (OCR) text conversion process. This digital history project applied natural language processing in an R language computer program to create a new and useful index of this corpus of digitized content despite OCR related errors. The project used editions of The Equity, published in Shawville, Quebec since 1883. The program extracted the names of all the person, location and organization entities that appeared in each edition. Each of the entities was cataloged in a database and related to the edition of the newspaper it appeared in. The database was published to a public website to allow other researchers to use it. The resulting index or finding aid allows researchers to access The Equity in a different way than just full text searching. People, locations and organizations appearing in the Equity are listed on the website and each entity links to a page that lists all of the issues that entity appeared in as well as the other entities that may be related to it. Rendering the text files of each scanned newspaper into entities and indexing them in a database allows the content of the newspaper to be interacted with by entity name and type rather than just a set of large text files. Website: http://www.jeffblackadar.ca/graham_fellowship/corpus_entities_equity/


Author(s):  
Binod Kumar Prasad

Purpose of the study: The purpose of this work is to present an offline Optical Character Recognition system to recognise handwritten English numerals to help automation of document reading. It helps to avoid tedious and time-consuming manual typing to key in important information in a computer system to preserve it for a longer time. Methodology: This work applies Curvature Features of English numeral images by encoding them in terms of distance and slope. The finer local details of images have been extracted by using Zonal features. The feature vectors obtained from the combination of these features have been fed to the KNN classifier. The whole work has been executed using the MatLab Image Processing toolbox. Main Findings: The system produces an average recognition rate of 96.67% with K=1 whereas, with K=3, the rate increased to 97% with corresponding errors of 3.33% and 3% respectively. Out of all the ten numerals, some numerals like ‘3’ and ‘8’ have shown respectively lower recognition rates. It is because of the similarity between their structures. Applications of this study: The proposed work is related to the recognition of English numerals. The model can be used widely for recognition of any pattern like signature verification, face recognition, character or word recognition in another language under Natural Language Processing, etc. Novelty/Originality of this study: The novelty of the work lies in the process of feature extraction. Curves present in the structure of a numeral sample have been encoded based on distance and slope thereby presenting Distance features and Slope features. Vertical Delta Distance Coding (VDDC) and Horizontal Delta Distance Coding (HDDC) encode a curve from vertical and horizontal directions to reveal concavity and convexity from different angles.


2013 ◽  
Vol 846-847 ◽  
pp. 1239-1242
Author(s):  
Yang Yang ◽  
Hui Zhang ◽  
Yong Qi Wang

This paper presents our recent work towards the development of a voice calculator based on speech error correction and natural language processing. The calculator enhances the accuracy of speech recognition by classifying and summarizing recognition errors on numerical calculation speech recognition area, then constructing Pinyin-text-mapping library and replacement rules, and combing priority correction mechanism and memory correction mechanism of Pinyin-text-mapping. For the expression after correctly recognizing, the calculator uses recursive-descent parsing algorithm and synthesized attribute computing algorithm to calculate the final result and output the result using TTS engine. The implementation of this voice calculator makes a calculator more humane and intelligent.


Sign in / Sign up

Export Citation Format

Share Document