An Open Source Tesseract Based Optical Character Recognizer for Bengali Language

Mapping Intimacies ◽

10.31224/osf.io/je3m8 ◽

2021 ◽

Author(s):

Redwan Islam

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character ◽

Word Detection ◽

Two Phases ◽

Bengali Language

Optical Character Recognition (OCR) is the process of extracting text from an image. The main purpose of an OCR is to make editable documents from existing paper documents or image files. OCR primarily works in two phases; they are character and word detection. In case of more sophisticated approach, an OCR also works on sentence detection to preserve documents’ structures. In this paper, we would discuss the process of developing an OCR for Bengali language. Lots of efforts have been put on developing an OCR for Bengali. Though some OCRs have been developed, none of them is completely error free. For our thesis, we trained Tesseract OCR Engine to develop an OCR for Bengali language. Tesseract is currently the most accurate OCR engine. This engine was developed at HP labs and currently sponsored by Google. In Tesseract there are two option to training first one is Legacy Training and second is LSTM Training. We do both of them.

Get full-text (via PubEx)

How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine – Final Notes on Development and Evaluation

Human Language Technology. Challenges for Computer Science and Linguistics - Lecture Notes in Computer Science ◽

10.1007/978-3-030-66527-2_2 ◽

2020 ◽

pp. 17-30

Author(s):

Mika Koistinen ◽

Kimmo Kettunen ◽

Jukka Kervinen

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Get full-text (via PubEx)

Performance Analysis of Open Source Optical Character Recognition

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9060 ◽

2020 ◽

Vol 17 (9) ◽

pp. 4267-4275

Author(s):

Jagadish Kallimani ◽

Chandrika Prasad ◽

D. Keerthana ◽

Manoj J. Shet ◽

Prasada Hegde ◽

...

Keyword(s):

Performance Analysis ◽

Comparative Study ◽

Open Source ◽

Error Rate ◽

Character Recognition ◽

Optical Character Recognition ◽

Accurate Result ◽

Recognition System ◽

Optical Character ◽

Recognition Systems

Optical character recognition is the process of conversion of images of text into machine-encoded text electronically or mechanically. The text on image can be handwritten, typed or printed. Some of the examples of image source can be a picture of a document, a scanned document or a text which is superimposed on an image. Most optical character recognition system does not give a 100% accurate result. This project aims at analyzing the error rate of a few open source optical character recognition systems (Boxoft OCR, ABBY, Tesseract, Free Online OCR etc.) on a set of diverse documents and makes a comparative study of the same. By this, we can study which OCR is the best suited for a document.

Get full-text (via PubEx)

Open source optical character recognition for historical research

Journal of Documentation ◽

10.1108/00220411211256021 ◽

2012 ◽

Vol 68 (5) ◽

pp. 659-683 ◽

Cited By ~ 3

Author(s):

Tobias Blanke ◽

Michael Bryant ◽

Mark Hedges

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Historical Research ◽

Optical Character

Get full-text (via PubEx)

Towards Syriac Digital Corpora: Evaluation of Tesseract 4.0 for Syriac OCR

Hugoye: Journal of Syriac Studies ◽

10.31826/hug-2019-220105 ◽

2019 ◽

Vol 22 (1) ◽

pp. 109-192

Author(s):

Emily Chesley ◽

Jillian Marcantonio ◽

Abigail Pearson

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Further Training ◽

Current State ◽

Optical Character ◽

Extensive Test ◽

Degree Of Confidence ◽

High Degree

Abstract This paper summarizes the results of an extensive test of Tesseract 4.0, an open-source Optical Character Recognition (OCR) engine with Syriac capabilities, and ascertains the current state of Syriac OCR technology. Three popular print types (S14, W64, and E22) representing the Syriac type styles Estrangela, Serto, and East Syriac were OCRed using Tesseract’s two different OCR modes (Syriac Language and Syriac Script). Handwritten manuscripts were also preliminarily tested for OCR. The tests confirm that Tesseract 4.0 may be relied upon for printed Estrangela texts but should be used with caution and human revision for Serto and East Syriac printed texts. Consonantal accuracy lies around 99% for Estrangela, between 89% and 94% for Serto, and around 89% for East Syriac. Scholars may use Tesseract to OCR Estrangela texts with a high degree of confidence, but further training of the engine will be required before Serto and East Syriac texts can be smoothly OCRed. In all type styles, human revision of the OCRed text is recommended when scholars desire an exact, error-free corpus.

Get full-text (via PubEx)

Machine Replication of Human Perusing using Optical Character Recognition with Tesseract

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b1079.1292s419 ◽

2019 ◽

Vol 9 (2S4) ◽

pp. 74-77

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Optical Character Recognition is the machine replication of human perusing. Electronic Conversion of examined pictures where picture can be type composed or printed content. It is executed utilizing Google's open source Optical Character Recognition programming called Tesseract. The OCR accepts picture as the information, gets content from that picture and afterward changes over it into whatever other language that the client needed. This framework can be helpful in different applications like banking, legitimate industry, explorers’ different ventures, and home and office robotization. It for the most part intended for individuals who are unfit to peruse any sort of content archives and to diminish the weight of information passage occupations.[4]

Get full-text (via PubEx)

Optical Character Recognition by Open source OCR Tool Tesseract: A Case Study

International Journal of Computer Applications ◽

10.5120/8794-2784 ◽

2012 ◽

Vol 55 (10) ◽

pp. 50-56 ◽

Cited By ~ 69

Author(s):

Chirag Patel ◽

Atul Patel ◽

Dharmendra Patel

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Get full-text (via PubEx)

Polarity Identification for Handwritten Text in Multilingual Documents Using Open Source Optical Character Recognition Tools

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9017 ◽

2020 ◽

Vol 17 (9) ◽

pp. 4045-4049

Author(s):

C. P. Chandrika ◽

Jagadish S. Kallimani

Keyword(s):

Web Services ◽

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Customer Reviews ◽

Optical Character ◽

Handwritten Text ◽

Amazon Web Services ◽

Rest Api

Sentimental analysis is a prerequisite for many applications. We propose a model which scans handwritten text in English and Kannada languages by a CamScanner and then translated into editable text by using various Open Source Optical Character Recognition tools. The performances of different OCRs are analyzed and tabulated. Sentimental analysis is performed on the statements written in both English and Kannada languages using Wordnet, Algorithmia Rest API and local dictionaries and we have obtained the satisfied results. The same sentimental analysis module is also applied on customer reviews for the mobile product and reviews are taken from Amazon Web Services. The opinion of the customer about the product can be identified correctly.

Get full-text (via PubEx)

An Accuracy Examination of OCR Tools

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i1102.0789s419 ◽

2019 ◽

Vol 8 (9S4) ◽

pp. 5-9

Keyword(s):

Comparative Analysis ◽

Comparative Study ◽

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Research Paper ◽

Text Segmentation ◽

Digital Form ◽

Optical Character ◽

Processing Algorithms

In this research paper, the authors have aimed to do a comparative study of optical character recognition using different open source OCR tools. Optical character recognition (OCR) method has been used in extracting the text from images. OCR has various applications which include extracting text from any document or image or involves just for reading and processing the text available in digital form. The accuracy of OCR can be dependent on text segmentation and pre-processing algorithms. Sometimes it is difficult to retrieve text from the image because of different size, style, orientation, a complex background of image etc. From vehicle number plate the authors tried to extract vehicle number by using various OCR tools like Tesseract, GOCR, Ocrad and Tensor flow. The authors in this research paper have tried to diagnose the best possible method for optical character recognition and have provided with a comparative analysis of their accuracy

Get full-text (via PubEx)

Aplikasi Penerjemah Bahasa Inggris – Indonesia dengan Optical Character Recognition Berbasis Android

Jurnal Teknologi dan Sistem Komputer ◽

10.14710/jtsiskom.4.1.2016.167-177 ◽

2016 ◽

Vol 4 (1) ◽

pp. 167 ◽

Cited By ~ 1

Author(s):

Anisa Eka Utami ◽

Oky Dwi Nurhayati ◽

Kurniawan Teguh Martono

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Perangkat lunak untuk pengenalan karakter yang terdapat dalam ponsel pintar khususnya berbasis Android dikembangkan dengan penekanan pada mobilitas, portabilitas, ruang penyimpanan, perangkat keras, dan keterbatasan jangkauan dapat dipecahkan. Akan tetapi, kinerja sebuah ponsel pintar berbasis Android dan komputer berbeda maka kecepatan pengenalan karakter juga akan berpengaruh. Masalah ini tampaknya akan menunjukkan suatu solusi, yaitu dengan salah satu inovasi yang diterapkankan ke dalam perangkat Android dengan teknologi OCR (Optical Character Recognition). Perencanaan sistem menggunakan pengembangan perangkat lunak berorientasi pemakaian ulang karena menggunakan komponen yang dapat dipakai ulang dalam pengembangannya. Sistem ini dibuat dengan memanfaatkan engine Tesseract OCR yang dikembangkan oleh Google bersifat open source. Perangkat lunak yang digunakan untuk merancang layout dan implementasi sistem, yaitu menggunakan lingkungan pengembang Android Studio yang ditulis dengan bahasa pemrograman Java dan XML. Pengujian aplikasi penerjemah dengan OCR ini menggunakan metode white box dan menghitung akurasi pendeteksian karakter. Hasil perhitungan presentase akurasi deteksi karakter yang diberikan aplikasi terhadap keseluruhan sampel yang diuji mencapai 97,5%.

Get full-text (via PubEx)

Adoption of an Open Source Optical Character Recognition (OCR) for Database Buildup of the Students’ Scholastic Records

International Journal of Information and Electronics Engineering ◽

10.18178/ijiee.2016.6.3.625 ◽

2016 ◽

Vol 6 (3) ◽

pp. 206-209 ◽

Cited By ~ 1

Author(s):

Milleth M. Bautista ◽

◽

Benilda Eleonor V. Comendador

Keyword(s):

Open Source ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character

Get full-text (via PubEx)