Optical Character Recognition by Open source OCR Tool Tesseract: A Case Study

Optical character recognition is the process of conversion of images of text into machine-encoded text electronically or mechanically. The text on image can be handwritten, typed or printed. Some of the examples of image source can be a picture of a document, a scanned document or a text which is superimposed on an image. Most optical character recognition system does not give a 100% accurate result. This project aims at analyzing the error rate of a few open source optical character recognition systems (Boxoft OCR, ABBY, Tesseract, Free Online OCR etc.) on a set of diverse documents and makes a comparative study of the same. By this, we can study which OCR is the best suited for a document.

Download Full-text

Open source optical character recognition for historical research

Journal of Documentation ◽

10.1108/00220411211256021 ◽

2012 ◽

Vol 68 (5) ◽

pp. 659-683 ◽

Cited By ~ 3

Author(s):

Tobias Blanke ◽

Michael Bryant ◽

Mark Hedges

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Historical Research ◽

Optical Character

Download Full-text

Towards Syriac Digital Corpora: Evaluation of Tesseract 4.0 for Syriac OCR

Hugoye: Journal of Syriac Studies ◽

10.31826/hug-2019-220105 ◽

2019 ◽

Vol 22 (1) ◽

pp. 109-192

Author(s):

Emily Chesley ◽

Jillian Marcantonio ◽

Abigail Pearson

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Further Training ◽

Current State ◽

Optical Character ◽

Extensive Test ◽

Degree Of Confidence ◽

High Degree

Abstract This paper summarizes the results of an extensive test of Tesseract 4.0, an open-source Optical Character Recognition (OCR) engine with Syriac capabilities, and ascertains the current state of Syriac OCR technology. Three popular print types (S14, W64, and E22) representing the Syriac type styles Estrangela, Serto, and East Syriac were OCRed using Tesseract’s two different OCR modes (Syriac Language and Syriac Script). Handwritten manuscripts were also preliminarily tested for OCR. The tests confirm that Tesseract 4.0 may be relied upon for printed Estrangela texts but should be used with caution and human revision for Serto and East Syriac printed texts. Consonantal accuracy lies around 99% for Estrangela, between 89% and 94% for Serto, and around 89% for East Syriac. Scholars may use Tesseract to OCR Estrangela texts with a high degree of confidence, but further training of the engine will be required before Serto and East Syriac texts can be smoothly OCRed. In all type styles, human revision of the OCRed text is recommended when scholars desire an exact, error-free corpus.

Download Full-text

Machine Replication of Human Perusing using Optical Character Recognition with Tesseract

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b1079.1292s419 ◽

2019 ◽

Vol 9 (2S4) ◽

pp. 74-77

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Optical Character Recognition is the machine replication of human perusing. Electronic Conversion of examined pictures where picture can be type composed or printed content. It is executed utilizing Google's open source Optical Character Recognition programming called Tesseract. The OCR accepts picture as the information, gets content from that picture and afterward changes over it into whatever other language that the client needed. This framework can be helpful in different applications like banking, legitimate industry, explorers’ different ventures, and home and office robotization. It for the most part intended for individuals who are unfit to peruse any sort of content archives and to diminish the weight of information passage occupations.[4]

Download Full-text

Case Study IV: Optical Character Recognition

Neural Network PC Tools ◽

10.1016/b978-0-12-228640-7.50019-2 ◽

1990 ◽

pp. 285-294

Author(s):

Gary Entsminger

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Download Full-text

Polarity Identification for Handwritten Text in Multilingual Documents Using Open Source Optical Character Recognition Tools

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9017 ◽

2020 ◽

Vol 17 (9) ◽

pp. 4045-4049

Author(s):

C. P. Chandrika ◽

Jagadish S. Kallimani

Keyword(s):

Web Services ◽

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Customer Reviews ◽

Optical Character ◽

Handwritten Text ◽

Amazon Web Services ◽

Rest Api

Sentimental analysis is a prerequisite for many applications. We propose a model which scans handwritten text in English and Kannada languages by a CamScanner and then translated into editable text by using various Open Source Optical Character Recognition tools. The performances of different OCRs are analyzed and tabulated. Sentimental analysis is performed on the statements written in both English and Kannada languages using Wordnet, Algorithmia Rest API and local dictionaries and we have obtained the satisfied results. The same sentimental analysis module is also applied on customer reviews for the mobile product and reviews are taken from Amazon Web Services. The opinion of the customer about the product can be identified correctly.

Download Full-text

Analysis of OCR Design and Implementation for GUI Modeling

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.128-129.1303 ◽

2011 ◽

Vol 128-129 ◽

pp. 1303-1307

Author(s):

Yu Mei Wu ◽

Zhi Fang Liu

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Work Load ◽

Test Case ◽

New Approach ◽

Design And Implementation ◽

Optical Character ◽

Automated Test Case Generation ◽

Model Based Testing

Many efforts have been taken to achieve automated Graphical User Interface (GUI) testing. The most popular way is model-based testing, which supports automated test case generation and execution. But building such a model is a non-trivial task, which usually costs the most work-load in the entire testing process. Most of the approaches about automated model deriving are dependant on the programming language or specific OS. In this paper, we proposed a new approach of GUI modeling using Optical Character Recognition (OCR), and technical poblems encountered have been analyzed in deatail. Case study shows that our approach is capable of analyzing most of the GUI windows, and generating corresponding model and hence eliminates the above constraint.

Download Full-text

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Digital Scholarship in the Humanities ◽

10.1093/llc/fqz024 ◽

2019 ◽

Vol 34 (4) ◽

pp. 825-843 ◽

Cited By ~ 3

Author(s):

Mark J Hill ◽

Simon Hengchen

Keyword(s):

Eighteenth Century ◽

Text Analysis ◽

Character Recognition ◽

Optical Character Recognition ◽

Ground Truth ◽

Optical Character ◽

Historical Text ◽

The Impact

Abstract This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.

Download Full-text

An Accuracy Examination of OCR Tools

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i1102.0789s419 ◽

2019 ◽

Vol 8 (9S4) ◽

pp. 5-9

Keyword(s):

Comparative Analysis ◽

Comparative Study ◽

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Research Paper ◽

Text Segmentation ◽

Digital Form ◽

Optical Character ◽

Processing Algorithms

In this research paper, the authors have aimed to do a comparative study of optical character recognition using different open source OCR tools. Optical character recognition (OCR) method has been used in extracting the text from images. OCR has various applications which include extracting text from any document or image or involves just for reading and processing the text available in digital form. The accuracy of OCR can be dependent on text segmentation and pre-processing algorithms. Sometimes it is difficult to retrieve text from the image because of different size, style, orientation, a complex background of image etc. From vehicle number plate the authors tried to extract vehicle number by using various OCR tools like Tesseract, GOCR, Ocrad and Tensor flow. The authors in this research paper have tried to diagnose the best possible method for optical character recognition and have provided with a comparative analysis of their accuracy

Download Full-text