scholarly journals An Accuracy Examination of OCR Tools

In this research paper, the authors have aimed to do a comparative study of optical character recognition using different open source OCR tools. Optical character recognition (OCR) method has been used in extracting the text from images. OCR has various applications which include extracting text from any document or image or involves just for reading and processing the text available in digital form. The accuracy of OCR can be dependent on text segmentation and pre-processing algorithms. Sometimes it is difficult to retrieve text from the image because of different size, style, orientation, a complex background of image etc. From vehicle number plate the authors tried to extract vehicle number by using various OCR tools like Tesseract, GOCR, Ocrad and Tensor flow. The authors in this research paper have tried to diagnose the best possible method for optical character recognition and have provided with a comparative analysis of their accuracy

2020 ◽  
Vol 17 (9) ◽  
pp. 4267-4275
Author(s):  
Jagadish Kallimani ◽  
Chandrika Prasad ◽  
D. Keerthana ◽  
Manoj J. Shet ◽  
Prasada Hegde ◽  
...  

Optical character recognition is the process of conversion of images of text into machine-encoded text electronically or mechanically. The text on image can be handwritten, typed or printed. Some of the examples of image source can be a picture of a document, a scanned document or a text which is superimposed on an image. Most optical character recognition system does not give a 100% accurate result. This project aims at analyzing the error rate of a few open source optical character recognition systems (Boxoft OCR, ABBY, Tesseract, Free Online OCR etc.) on a set of diverse documents and makes a comparative study of the same. By this, we can study which OCR is the best suited for a document.


2021 ◽  
Vol 11 (2) ◽  
pp. 83-86
Author(s):  
Alan Jiju ◽  
Shaun Tuscano ◽  
Chetana Badgujar

This research tries to find out a methodology through which any data from the daily-use printed bills and invoices can be extracted. The data from these bills or invoices can be used extensively later on – such as machine learning or statistical analysis. This research focuses on extraction of final bill-amount, itinerary, date and similar data from bills and invoices as they encapsulate an ample amount of information about the users purchases, likes or dislikes etc. Optical Character Recognition (OCR) technology is a system that provides a full alphanumeric recognition of printed or handwritten characters from images. Initially, OpenCV has been used to detect the bill or invoice from the image and filter out the unnecessary noise from the image. Then intermediate image is passed for further processing using Tesseract OCR engine, which is an optical character recognition engine. Tesseract intends to apply Text Segmentation in order to extract written text in various fonts and languages. Our methodology proves to be highly accurate while tested on a variety of input images of bills and invoices.


2019 ◽  
Vol 22 (1) ◽  
pp. 109-192
Author(s):  
Emily Chesley ◽  
Jillian Marcantonio ◽  
Abigail Pearson

Abstract This paper summarizes the results of an extensive test of Tesseract 4.0, an open-source Optical Character Recognition (OCR) engine with Syriac capabilities, and ascertains the current state of Syriac OCR technology. Three popular print types (S14, W64, and E22) representing the Syriac type styles Estrangela, Serto, and East Syriac were OCRed using Tesseract’s two different OCR modes (Syriac Language and Syriac Script). Handwritten manuscripts were also preliminarily tested for OCR. The tests confirm that Tesseract 4.0 may be relied upon for printed Estrangela texts but should be used with caution and human revision for Serto and East Syriac printed texts. Consonantal accuracy lies around 99% for Estrangela, between 89% and 94% for Serto, and around 89% for East Syriac. Scholars may use Tesseract to OCR Estrangela texts with a high degree of confidence, but further training of the engine will be required before Serto and East Syriac texts can be smoothly OCRed. In all type styles, human revision of the OCRed text is recommended when scholars desire an exact, error-free corpus.


Author(s):  
Husni Al-Muhtaseb ◽  
Rami Qahwaji

Arabic text recognition is receiving more attentions from both Arabic and non-Arabic-speaking researchers. This chapter provides a general overview of the state-of-the-art in Arabic Optical Character Recognition (OCR) and the associated text recognition technology. It also investigates the characteristics of the Arabic language with respect to OCR and discusses related research on the different phases of text recognition including: pre-processing and text segmentation, common feature extraction techniques, classification methods and post-processing techniques. Moreover, the chapter discusses the available databases for Arabic OCR research and lists the available commercial Software. Finally, it explores the challenges related to Arabic OCR and discusses possible future trends.


2018 ◽  
Vol 7 (2.24) ◽  
pp. 361 ◽  
Author(s):  
Nitin Ramesh ◽  
Aksha Srivastava ◽  
K Deeba

Document text recognition uses a concept called OCR (optical character recognition),which is the recognition of printed or written text characters by a computer. This involves scanning a document containing text, and converting character by character to their digital form. Thus, it is defined as the process of digitizing a document image into its constituent characters. Equipment used to obtain clearer images for analysis are cameras and flatbed scanners. Even though it’s been out in the world since 1870, the OCR technology is yet to reach perfection. This demanding nature of Optical Character Recognition has made various researchers, industries and technology enthusiasts to divulge their attention to this field. In recent times one can notice a significant increase in the number of research organizations investing their time and effort in this field. In this research, the progress, different aspects and various issues revolving in this field have been summarized. The aim is to present a scrupulous overview of various proposals, advancements and discussions aimed at resolving various problems that arise in traditional OCR.  


Research is deliberately going on in the field of pattern recognition. New ideas are developed and implemented in this field throughout the globe. Optical Character Recognition (OCR) is one of the inseparable applications of Pattern Recognition. Though extensive research is already reported in this field, but multilingual Optical Character Recognition is the most challenging aspect which is still, the need of the hour. Myriads of researchers are digging the information to gather the best solutions for the recognition purpose. In this research paper, we are purposing the steps for the recognition of Devanagari and English scripts simultaneously occurring in the documents. A new approach of segmentation and splitting the characters of both the scripts is also introduced for the benefits of researchers. Most commonly in the documents containing English and Devanagari scripts, English characters are already separated, the challenge is to separate the Devanagari characters. Algorithm to implement the challenging aspect to segment the Devanagari and Roman scripts simultaneously is also implemented in the present paper.


Sign in / Sign up

Export Citation Format

Share Document