How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine – Final Notes on Development and Evaluation

Author(s):  
Mika Koistinen ◽  
Kimmo Kettunen ◽  
Jukka Kervinen
2020 ◽  
Vol 17 (9) ◽  
pp. 4267-4275
Author(s):  
Jagadish Kallimani ◽  
Chandrika Prasad ◽  
D. Keerthana ◽  
Manoj J. Shet ◽  
Prasada Hegde ◽  
...  

Optical character recognition is the process of conversion of images of text into machine-encoded text electronically or mechanically. The text on image can be handwritten, typed or printed. Some of the examples of image source can be a picture of a document, a scanned document or a text which is superimposed on an image. Most optical character recognition system does not give a 100% accurate result. This project aims at analyzing the error rate of a few open source optical character recognition systems (Boxoft OCR, ABBY, Tesseract, Free Online OCR etc.) on a set of diverse documents and makes a comparative study of the same. By this, we can study which OCR is the best suited for a document.


2012 ◽  
Vol 68 (5) ◽  
pp. 659-683 ◽  
Author(s):  
Tobias Blanke ◽  
Michael Bryant ◽  
Mark Hedges

2019 ◽  
Vol 22 (1) ◽  
pp. 109-192
Author(s):  
Emily Chesley ◽  
Jillian Marcantonio ◽  
Abigail Pearson

Abstract This paper summarizes the results of an extensive test of Tesseract 4.0, an open-source Optical Character Recognition (OCR) engine with Syriac capabilities, and ascertains the current state of Syriac OCR technology. Three popular print types (S14, W64, and E22) representing the Syriac type styles Estrangela, Serto, and East Syriac were OCRed using Tesseract’s two different OCR modes (Syriac Language and Syriac Script). Handwritten manuscripts were also preliminarily tested for OCR. The tests confirm that Tesseract 4.0 may be relied upon for printed Estrangela texts but should be used with caution and human revision for Serto and East Syriac printed texts. Consonantal accuracy lies around 99% for Estrangela, between 89% and 94% for Serto, and around 89% for East Syriac. Scholars may use Tesseract to OCR Estrangela texts with a high degree of confidence, but further training of the engine will be required before Serto and East Syriac texts can be smoothly OCRed. In all type styles, human revision of the OCRed text is recommended when scholars desire an exact, error-free corpus.


Optical Character Recognition is the machine replication of human perusing. Electronic Conversion of examined pictures where picture can be type composed or printed content. It is executed utilizing Google's open source Optical Character Recognition programming called Tesseract. The OCR accepts picture as the information, gets content from that picture and afterward changes over it into whatever other language that the client needed. This framework can be helpful in different applications like banking, legitimate industry, explorers’ different ventures, and home and office robotization. It for the most part intended for individuals who are unfit to peruse any sort of content archives and to diminish the weight of information passage occupations.[4]


2020 ◽  
Vol 17 (9) ◽  
pp. 4045-4049
Author(s):  
C. P. Chandrika ◽  
Jagadish S. Kallimani

Sentimental analysis is a prerequisite for many applications. We propose a model which scans handwritten text in English and Kannada languages by a CamScanner and then translated into editable text by using various Open Source Optical Character Recognition tools. The performances of different OCRs are analyzed and tabulated. Sentimental analysis is performed on the statements written in both English and Kannada languages using Wordnet, Algorithmia Rest API and local dictionaries and we have obtained the satisfied results. The same sentimental analysis module is also applied on customer reviews for the mobile product and reviews are taken from Amazon Web Services. The opinion of the customer about the product can be identified correctly.


In this research paper, the authors have aimed to do a comparative study of optical character recognition using different open source OCR tools. Optical character recognition (OCR) method has been used in extracting the text from images. OCR has various applications which include extracting text from any document or image or involves just for reading and processing the text available in digital form. The accuracy of OCR can be dependent on text segmentation and pre-processing algorithms. Sometimes it is difficult to retrieve text from the image because of different size, style, orientation, a complex background of image etc. From vehicle number plate the authors tried to extract vehicle number by using various OCR tools like Tesseract, GOCR, Ocrad and Tensor flow. The authors in this research paper have tried to diagnose the best possible method for optical character recognition and have provided with a comparative analysis of their accuracy


2016 ◽  
Vol 4 (1) ◽  
pp. 167 ◽  
Author(s):  
Anisa Eka Utami ◽  
Oky Dwi Nurhayati ◽  
Kurniawan Teguh Martono

Perangkat lunak untuk pengenalan karakter yang terdapat dalam ponsel pintar khususnya berbasis Android dikembangkan dengan penekanan pada mobilitas, portabilitas, ruang penyimpanan, perangkat keras, dan keterbatasan jangkauan dapat dipecahkan. Akan tetapi, kinerja sebuah ponsel pintar berbasis Android dan komputer berbeda maka kecepatan pengenalan karakter juga akan berpengaruh. Masalah ini tampaknya akan menunjukkan suatu solusi, yaitu dengan salah satu inovasi yang diterapkankan ke dalam perangkat Android dengan teknologi OCR (Optical Character Recognition). Perencanaan sistem menggunakan pengembangan perangkat lunak berorientasi pemakaian ulang karena menggunakan komponen yang dapat dipakai ulang dalam pengembangannya. Sistem ini dibuat dengan memanfaatkan engine Tesseract OCR yang dikembangkan oleh Google bersifat open source. Perangkat lunak yang digunakan untuk merancang layout dan implementasi sistem, yaitu menggunakan lingkungan pengembang Android Studio yang ditulis dengan bahasa pemrograman Java dan XML. Pengujian aplikasi penerjemah dengan OCR ini menggunakan metode white box dan menghitung akurasi pendeteksian karakter. Hasil perhitungan presentase akurasi deteksi karakter yang diberikan aplikasi terhadap keseluruhan sampel yang diuji mencapai 97,5%.


2012 ◽  
Vol 55 (10) ◽  
pp. 50-56 ◽  
Author(s):  
Chirag Patel ◽  
Atul Patel ◽  
Dharmendra Patel

2021 ◽  
Author(s):  
Redwan Islam

Optical Character Recognition (OCR) is the process of extracting text from an image. The main purpose of an OCR is to make editable documents from existing paper documents or image files. OCR primarily works in two phases; they are character and word detection. In case of more sophisticated approach, an OCR also works on sentence detection to preserve documents’ structures. In this paper, we would discuss the process of developing an OCR for Bengali language. Lots of efforts have been put on developing an OCR for Bengali. Though some OCRs have been developed, none of them is completely error free. For our thesis, we trained Tesseract OCR Engine to develop an OCR for Bengali language. Tesseract is currently the most accurate OCR engine. This engine was developed at HP labs and currently sponsored by Google. In Tesseract there are two option to training first one is Legacy Training and second is LSTM Training. We do both of them.


Sign in / Sign up

Export Citation Format

Share Document