scholarly journals Improving the Accuracy of Tesseract 4.0 OCR Engine Using Convolution-Based Preprocessing

Symmetry ◽  
2020 ◽  
Vol 12 (5) ◽  
pp. 715
Author(s):  
Dan Sporici ◽  
Elena Cușnir ◽  
Costin-Anton Boiangiu

Optical Character Recognition (OCR) is the process of identifying and converting texts rendered in images using pixels to a more computer-friendly representation. The presented work aims to prove that the accuracy of the Tesseract 4.0 OCR engine can be further enhanced by employing convolution-based preprocessing using specific kernels. As Tesseract 4.0 has proven great performance when evaluated against a favorable input, its capability of properly detecting and identifying characters in more realistic, unfriendly images is questioned. The article proposes an adaptive image preprocessing step guided by a reinforcement learning model, which attempts to minimize the edit distance between the recognized text and the ground truth. It is shown that this approach can boost the character-level accuracy of Tesseract 4.0 from 0.134 to 0.616 (+359% relative change) and the F1 score from 0.163 to 0.729 (+347% relative change) on a dataset that is considered challenging by its authors.

Theoretical—This paper shows a camera based assistive content perusing of item marks from articles to support outwardly tested individuals. Camera fills in as fundamental wellspring of info. To recognize the items, the client will move the article before camera and this moving item will be identified by Background Subtraction (BGS) Method. Content district will be naturally confined as Region of Interest (ROI). Content is extricated from ROI by consolidating both guideline based and learning based technique. A tale standard based content limitation calculation is utilized by recognizing geometric highlights like pixel esteem, shading force, character size and so forth and furthermore highlights like Gradient size, slope width and stroke width are found out utilizing SVM classifier and a model is worked to separate content and non-content area. This framework is coordinated with OCR (Optical Character Recognition) to extricate content and the separated content is given as a voice yield to the client. The framework is assessed utilizing ICDAR-2011 dataset which comprise of 509 common scene pictures with ground truth.


Author(s):  
FANG-HSUAN CHENG ◽  
WEN-HSING HSU

This paper describes typical research on Chinese optical character recognition in Taiwan. Chinese characters can be represented by a set of basic line segments called strokes. Several approaches to the recognition of handwritten Chinese characters by stroke analysis are described here. A typical optical character recognition (OCR) system consists of four main parts: image preprocessing, feature extraction, radical extraction and matching. Image preprocessing is used to provide the suitable format for data processing. Feature extraction is used to extract stable features from the Chinese character. Radical extraction is used to decompose the Chinese character into radicals. Finally, matching is used to recognize the Chinese character. The reasons for using strokes as the features for Chinese character recognition are the following. First, all Chinese characters can be represented by a combination of strokes. Second, the algorithms developed under the concept of strokes do not have to be modified when the number of characters increases. Therefore, the algorithms described in this paper are suitable for recognizing large sets of Chinese characters.


IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 174437-174448
Author(s):  
Jaewoo Park ◽  
Eunji Lee ◽  
Yoonsik Kim ◽  
Isaac Kang ◽  
Hyung Il Koo ◽  
...  

Optical Character Recognition or Optical Character Reader (OCR) is a pattern-based method consciousness that transforms the concept of electronic conversion of images of handwritten text or printed text in a text compiled. Equipment or tools used for that purpose are cameras and apartment scanners. Handwritten text is scanned using a scanner. The image of the scrutinized document is processed using the program. Identification of manuscripts is difficult compared to other western language texts. In our proposed work we will accept the challenge of identifying letters and letters and working to achieve the same. Image Preprocessing techniques can effectively improve the accuracy of an OCR engine. The goal is to design and implement a machine with a learning machine and Python that is best to work with more accurate than OCR's pre-built machines with unique technologies such as MatLab, Artificial Intelligence, Neural networks, etc.


2021 ◽  
Vol 2 (2) ◽  
pp. 68
Author(s):  
Daniel Setiawan Cahyono ◽  
Shinta Estri Wahyuningrum

Optical Character Recognition (OCR) is a method for computer to process an image that contains some text and then try to find any characters in that image, then convert it to digital text. In this research, Advanced Local Binary Pattern and Chain Code algorithm will be tested to identify alphabets in the image. Several method image preprocessing are also needed, such as image transformation, image rescaling, grayscale conversion, edge detection and edge thinning.


2019 ◽  
Vol 34 (4) ◽  
pp. 825-843 ◽  
Author(s):  
Mark J Hill ◽  
Simon Hengchen

Abstract This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.


2020 ◽  
Vol 7 (2) ◽  
pp. 105-113
Author(s):  
Mayanda Mega Santoni ◽  
Nurul Chamidah ◽  
Desta Sandya Prasvita ◽  
Reza Amarta Prayoga ◽  
Bayu Permana Sukma

Tri Gatra Bangun Bahasa yaitu utamakan Bahasa Indonesia, lestarikan bahasa daerah, dan kuasai bahasa asing. Melalui ini, maka bahasa daerah sebagai salah satu kekayaan bangsa Indonesia perlu dilestarikan. Selain itu, bahasa daerah juga berfungsi sebagai pendukung bahasa nasional yakni Bahasa Indonesia. Pemanfaatan teknologi dapat digunakan sebagai upaya dalam pelestarian bahasa daerah. Penelitian ini memanfaatkan teknologi kecerdasan buatan yakni mesin penerjemah yang menerjemahkan Bahasa Indonesia ke bahasa daerah berbasiskan citra teks. Bahasa daerah yang digunakan yakni bahasa daerah Minang. Fokus penelitian ini pada proses penerjemahan hasil optical character recognition (OCR) dari citra teks Bahasa Indonesia menggunakan algoritma edit distance, yakni hamming distance, Leveinshtein distance dan Jaro-Winkler. Hasil penelitian ini menunjukkan bahwa algoritma edit distance dapat memperbaiki hasil OCR dalam melakukan penerjemahan ke bahasa daerah. Hasil OCR pada citra teks memiliki akurasi awal yakni 50.72%. Setelah diterapkan algoritma edit distance, akurasi penerjemahan meningkat menjadi 68.34% untuk algoritma hamming distance, 70.5% untuk algoritma Leveinshtein distance dan 70.2% untuk algoritma Jaro-Winkler. Dari ketiga algoritma ini, Leveinshtein distance memiliki performasi akurasi penerjemahan paling tinggi. Kata Kunci: penerjemahan, bahasa Indonesia, bahasa Minang, hamming distance, leveinshtein distance, jaro-winkler, optical character recognition


2018 ◽  
Vol 29 (1) ◽  
pp. 688-702 ◽  
Author(s):  
Suman Kumar Bera ◽  
Radib Kar ◽  
Souvik Saha ◽  
Akash Chakrabarty ◽  
Sagnik Lahiri ◽  
...  

Abstract Handwritten words can never complement printed words because the former are mostly written in either skewed or slanted form or in both. This very nature of handwriting adds a huge overhead when converting word images into machine-editable format through an optical character recognition system. Therefore, slope and slant corrections are considered as the fundamental pre-processing tasks in handwritten word recognition. For solving this, researchers have followed a two-pass approach where the slope of the word is corrected first and then slant correction is carried out subsequently, thus making the system computationally expensive. To address this issue, we propose a novel one-pass method, based on fitting an oblique ellipse over the word images, to estimate both the slope and slant angles of the same. Furthermore, we have developed three databases considering word images of three popular scripts used in India, namely Bangla, Devanagari, and Roman, along with ground truth information. The experimental results revealed the effectiveness of the proposed method over some state-of-the-art methods used for the aforementioned problem.


1997 ◽  
Vol 9 (1-3) ◽  
pp. 58-77
Author(s):  
Vitaly Kliatskine ◽  
Eugene Shchepin ◽  
Gunnar Thorvaldsen ◽  
Konstantin Zingerman ◽  
Valery Lazarev

In principle, printed source material should be made machine-readable with systems for Optical Character Recognition, rather than being typed once more. Offthe-shelf commercial OCR programs tend, however, to be inadequate for lists with a complex layout. The tax assessment lists that assess most nineteenth century farms in Norway, constitute one example among a series of valuable sources which can only be interpreted successfully with specially designed OCR software. This paper considers the problems involved in the recognition of material with a complex table structure, outlining a new algorithmic model based on ‘linked hierarchies’. Within the scope of this model, a variety of tables and layouts can be described and recognized. The ‘linked hierarchies’ model has been implemented in the ‘CRIPT’ OCR software system, which successfully reads tables with a complex structure from several different historical sources.


Sign in / Sign up

Export Citation Format

Share Document