An Accuracy Examination of OCR Tools

In this research paper, the authors have aimed to do a comparative study of optical character recognition using different open source OCR tools. Optical character recognition (OCR) method has been used in extracting the text from images. OCR has various applications which include extracting text from any document or image or involves just for reading and processing the text available in digital form. The accuracy of OCR can be dependent on text segmentation and pre-processing algorithms. Sometimes it is difficult to retrieve text from the image because of different size, style, orientation, a complex background of image etc. From vehicle number plate the authors tried to extract vehicle number by using various OCR tools like Tesseract, GOCR, Ocrad and Tensor flow. The authors in this research paper have tried to diagnose the best possible method for optical character recognition and have provided with a comparative analysis of their accuracy

Download Full-text

Performance Analysis of Open Source Optical Character Recognition

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9060 ◽

2020 ◽

Vol 17 (9) ◽

pp. 4267-4275

Author(s):

Jagadish Kallimani ◽

Chandrika Prasad ◽

D. Keerthana ◽

Manoj J. Shet ◽

Prasada Hegde ◽

...

Keyword(s):

Performance Analysis ◽

Comparative Study ◽

Open Source ◽

Error Rate ◽

Character Recognition ◽

Optical Character Recognition ◽

Accurate Result ◽

Recognition System ◽

Optical Character ◽

Recognition Systems

Optical character recognition is the process of conversion of images of text into machine-encoded text electronically or mechanically. The text on image can be handwritten, typed or printed. Some of the examples of image source can be a picture of a document, a scanned document or a text which is superimposed on an image. Most optical character recognition system does not give a 100% accurate result. This project aims at analyzing the error rate of a few open source optical character recognition systems (Boxoft OCR, ABBY, Tesseract, Free Online OCR etc.) on a set of diverse documents and makes a comparative study of the same. By this, we can study which OCR is the best suited for a document.

Download Full-text

How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine – Final Notes on Development and Evaluation

Human Language Technology. Challenges for Computer Science and Linguistics - Lecture Notes in Computer Science ◽

10.1007/978-3-030-66527-2_2 ◽

2020 ◽

pp. 17-30

Author(s):

Mika Koistinen ◽

Kimmo Kettunen ◽

Jukka Kervinen

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Download Full-text

A comparative study of different approaches of primitive printed Arabic Optical Character Recognition

2015 11th International Computer Engineering Conference (ICENCO) ◽

10.1109/icenco.2015.7416333 ◽

2015 ◽

Cited By ~ 1

Author(s):

Mohamed Dahi ◽

Noura A. Semary ◽

Mohiy M. Hadhoud

Keyword(s):

Comparative Study ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Download Full-text

Open source optical character recognition for historical research

Journal of Documentation ◽

10.1108/00220411211256021 ◽

2012 ◽

Vol 68 (5) ◽

pp. 659-683 ◽

Cited By ~ 3

Author(s):

Tobias Blanke ◽

Michael Bryant ◽

Mark Hedges

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Historical Research ◽

Optical Character

Download Full-text

OCR Text Extraction

International Journal of Engineering and Management Research ◽

10.31033/ijemr.11.2.11 ◽

2021 ◽

Vol 11 (2) ◽

pp. 83-86

Author(s):

Alan Jiju ◽

Shaun Tuscano ◽

Chetana Badgujar

Keyword(s):

Machine Learning ◽

Statistical Analysis ◽

Character Recognition ◽

Optical Character Recognition ◽

Text Segmentation ◽

Similar Data ◽

Written Text ◽

Amount Of Information ◽

Optical Character ◽

Intermediate Image

This research tries to find out a methodology through which any data from the daily-use printed bills and invoices can be extracted. The data from these bills or invoices can be used extensively later on – such as machine learning or statistical analysis. This research focuses on extraction of final bill-amount, itinerary, date and similar data from bills and invoices as they encapsulate an ample amount of information about the users purchases, likes or dislikes etc. Optical Character Recognition (OCR) technology is a system that provides a full alphanumeric recognition of printed or handwritten characters from images. Initially, OpenCV has been used to detect the bill or invoice from the image and filter out the unnecessary noise from the image. Then intermediate image is passed for further processing using Tesseract OCR engine, which is an optical character recognition engine. Tesseract intends to apply Text Segmentation in order to extract written text in various fonts and languages. Our methodology proves to be highly accurate while tested on a variety of input images of bills and invoices.

Download Full-text

Towards Syriac Digital Corpora: Evaluation of Tesseract 4.0 for Syriac OCR

Hugoye: Journal of Syriac Studies ◽

10.31826/hug-2019-220105 ◽

2019 ◽

Vol 22 (1) ◽

pp. 109-192

Author(s):

Emily Chesley ◽

Jillian Marcantonio ◽

Abigail Pearson

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Further Training ◽

Current State ◽

Optical Character ◽

Extensive Test ◽

Degree Of Confidence ◽

High Degree

Abstract This paper summarizes the results of an extensive test of Tesseract 4.0, an open-source Optical Character Recognition (OCR) engine with Syriac capabilities, and ascertains the current state of Syriac OCR technology. Three popular print types (S14, W64, and E22) representing the Syriac type styles Estrangela, Serto, and East Syriac were OCRed using Tesseract’s two different OCR modes (Syriac Language and Syriac Script). Handwritten manuscripts were also preliminarily tested for OCR. The tests confirm that Tesseract 4.0 may be relied upon for printed Estrangela texts but should be used with caution and human revision for Serto and East Syriac printed texts. Consonantal accuracy lies around 99% for Estrangela, between 89% and 94% for Serto, and around 89% for East Syriac. Scholars may use Tesseract to OCR Estrangela texts with a high degree of confidence, but further training of the engine will be required before Serto and East Syriac texts can be smoothly OCRed. In all type styles, human revision of the OCRed text is recommended when scholars desire an exact, error-free corpus.

Download Full-text

Arabic Optical Character Recognition

Applied Signal and Image Processing ◽

10.4018/978-1-60960-477-6.ch019 ◽

2011 ◽

pp. 324-346 ◽

Cited By ~ 1

Author(s):

Husni Al-Muhtaseb ◽

Rami Qahwaji

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Arabic Language ◽

Text Recognition ◽

Text Segmentation ◽

Future Trends ◽

Optical Character ◽

Arabic Ocr ◽

Processing Techniques ◽

Arabic Speaking

Arabic text recognition is receiving more attentions from both Arabic and non-Arabic-speaking researchers. This chapter provides a general overview of the state-of-the-art in Arabic Optical Character Recognition (OCR) and the associated text recognition technology. It also investigates the characteristics of the Arabic language with respect to OCR and discusses related research on the different phases of text recognition including: pre-processing and text segmentation, common feature extraction techniques, classification methods and post-processing techniques. Moreover, the chapter discusses the available databases for Arabic OCR research and lists the available commercial Software. Finally, it explores the challenges related to Arabic OCR and discusses possible future trends.

Download Full-text

Improving Optical Character Recognition Techniques

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.24.12085 ◽

2018 ◽

Vol 7 (2.24) ◽

pp. 361 ◽

Cited By ~ 1

Author(s):

Nitin Ramesh ◽

Aksha Srivastava ◽

K Deeba

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Document Image ◽

Text Recognition ◽

Digital Form ◽

Written Text ◽

Optical Character ◽

Research Organizations ◽

The World

Document text recognition uses a concept called OCR (optical character recognition),which is the recognition of printed or written text characters by a computer. This involves scanning a document containing text, and converting character by character to their digital form. Thus, it is defined as the process of digitizing a document image into its constituent characters. Equipment used to obtain clearer images for analysis are cameras and flatbed scanners. Even though it’s been out in the world since 1870, the OCR technology is yet to reach perfection. This demanding nature of Optical Character Recognition has made various researchers, industries and technology enthusiasts to divulge their attention to this field. In recent times one can notice a significant increase in the number of research organizations investing their time and effort in this field. In this research, the progress, different aspects and various issues revolving in this field have been summarized. The aim is to present a scrupulous overview of various proposals, advancements and discussions aimed at resolving various problems that arise in traditional OCR.

Download Full-text

A Comparative Study of Optical Character Recognition in Health Information System

2019 International Conference in Engineering Applications (ICEA) ◽

10.1109/ceap.2019.8883448 ◽

2019 ◽

Author(s):

Mario R. M. Ribeiro ◽

Duarte Julio ◽

Vasco Abelha ◽

Antonio Abelha ◽

Jose Machado

Keyword(s):

Information System ◽

Comparative Study ◽

Health Information ◽

Character Recognition ◽

Optical Character Recognition ◽

Health Information System ◽

Optical Character

Download Full-text

Demystification of Bilingual Optical Character Recognition System for Devanagari and English Scripts

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1856.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 5180-5185

Keyword(s):

Pattern Recognition ◽

Character Recognition ◽

Optical Character Recognition ◽

Recognition System ◽

Research Paper ◽

New Approach ◽

Optical Character ◽

New Ideas

Research is deliberately going on in the field of pattern recognition. New ideas are developed and implemented in this field throughout the globe. Optical Character Recognition (OCR) is one of the inseparable applications of Pattern Recognition. Though extensive research is already reported in this field, but multilingual Optical Character Recognition is the most challenging aspect which is still, the need of the hour. Myriads of researchers are digging the information to gather the best solutions for the recognition purpose. In this research paper, we are purposing the steps for the recognition of Devanagari and English scripts simultaneously occurring in the documents. A new approach of segmentation and splitting the characters of both the scripts is also introduced for the benefits of researchers. Most commonly in the documents containing English and Devanagari scripts, English characters are already separated, the challenge is to separate the Devanagari characters. Algorithm to implement the challenging aspect to segment the Devanagari and Roman scripts simultaneously is also implemented in the present paper.

Download Full-text