iDocChip: A Configurable Hardware Accelerator for an End-to-End Historical Document Image Processing

In recent years, there has been an increasing demand to digitize and electronically access historical records. Optical character recognition (OCR) is typically applied to scanned historical archives to transcribe them from document images into machine-readable texts. Many libraries offer special stationary equipment for scanning historical documents. However, to digitize these records without removing them from where they are archived, portable devices that combine scanning and OCR capabilities are required. An existing end-to-end OCR software called anyOCR achieves high recognition accuracy for historical documents. However, it is unsuitable for portable devices, as it exhibits high computational complexity resulting in long runtime and high power consumption. Therefore, we have designed and implemented a configurable hardware-software programmable SoC called iDocChip that makes use of anyOCR techniques to achieve high accuracy. As a low-power and energy-efficient system with real-time capabilities, the iDocChip delivers the required portability. In this paper, we present the hybrid CPU-FPGA architecture of iDocChip along with the optimized software implementations of the anyOCR. We demonstrate our results on multiple platforms with respect to runtime and power consumption. The iDocChip system outperforms the existing anyOCR by 44× while achieving 2201× higher energy efficiency and a 3.8% increase in recognition accuracy.

Download Full-text

iDocChip: A Configurable Hardware Architecture for Historical Document Image Processing

International Journal of Parallel Programming ◽

10.1007/s10766-020-00690-y ◽

2021 ◽

Author(s):

Menbere Kina Tekleyohannes ◽

Vladimir Rybalkin ◽

Muhammad Mohsin Ghaffar ◽

Javier Alejandro Varela ◽

Norbert Wehn ◽

...

Keyword(s):

Power Consumption ◽

Character Recognition ◽

Optical Character Recognition ◽

High Accuracy ◽

Document Image ◽

System On Chip ◽

Historical Documents ◽

Optical Character ◽

Historical Archives ◽

On Chip

AbstractIn recent years, $$\hbox {optical character recognition (OCR)}$$ optical character recognition (OCR) systems have been used to digitally preserve historical archives. To transcribe historical archives into a machine-readable form, first, the documents are scanned, then an $$\hbox {OCR}$$ OCR is applied. In order to digitize documents without the need to remove them from where they are archived, it is valuable to have a portable device that combines scanning and $$\hbox {OCR}$$ OCR capabilities. Nowadays, there exist many commercial and open-source document digitization techniques, which are optimized for contemporary documents. However, they fail to give sufficient text recognition accuracy for transcribing historical documents due to the severe quality degradation of such documents. On the contrary, the anyOCR system, which is designed to mainly digitize historical documents, provides high accuracy. However, this comes at a cost of high computational complexity resulting in long runtime and high power consumption. To tackle these challenges, we propose a low power energy-efficient accelerator with real-time capabilities called iDocChip, which is a configurable hybrid hardware-software programmable $$\hbox {System-on-Chip (SoC)}$$ System-on-Chip (SoC) based on anyOCR for digitizing historical documents. In this paper, we focus on one of the most crucial processing steps in the anyOCR system: Text and Image Segmentation, which makes use of a multi-resolution morphology-based algorithm. Moreover, an optimized $$\hbox {FPGA}$$ FPGA -based hybrid architecture of this anyOCR step along with its optimized software implementations are presented. We demonstrate our results on multiple embedded and general-purpose platforms with respect to runtime and power consumption. The resulting hardware accelerator outperforms the existing anyOCR by 6.2$$\times$$ × , while achieving 207$$\times$$ × higher energy-efficiency and maintaining its high accuracy.

Download Full-text

MELHISSA: a multilingual entity linking architecture for historical press articles

International Journal on Digital Libraries ◽

10.1007/s00799-021-00319-6 ◽

2021 ◽

Author(s):

Elvys Linhares Pontes ◽

Luis Adrián Cabrera-Diego ◽

Jose G. Moreno ◽

Emanuela Boros ◽

Ahmed Hamdi ◽

...

Keyword(s):

Language Processing ◽

Digital Libraries ◽

Character Recognition ◽

Optical Character Recognition ◽

Historical Documents ◽

Entity Linking ◽

Named Entities ◽

European Languages ◽

Meta Information ◽

The Impact

AbstractDigital libraries have a key role in cultural heritage as they provide access to our culture and history by indexing books and historical documents (newspapers and letters). Digital libraries use natural language processing (NLP) tools to process these documents and enrich them with meta-information, such as named entities. Despite recent advances in these NLP models, most of them are built for specific languages and contemporary documents that are not optimized for handling historical material that may for instance contain language variations and optical character recognition (OCR) errors. In this work, we focused on the entity linking (EL) task that is fundamental to the indexation of documents in digital libraries. We developed a Multilingual Entity Linking architecture for HIstorical preSS Articles that is composed of multilingual analysis, OCR correction, and filter analysis to alleviate the impact of historical documents in the EL task. The source code is publicly available. Experimentation has been done over two historical documents covering five European languages (English, Finnish, French, German, and Swedish). Results have shown that our system improved the global performance for all languages and datasets by achieving an F-score@1 of up to 0.681 and an F-score@5 of up to 0.787.

Download Full-text

Word-Level Script Identification Using Texture Based Features

International Journal of System Dynamics Applications ◽

10.4018/ijsda.2015040105 ◽

2015 ◽

Vol 4 (2) ◽

pp. 74-94

Author(s):

Pawan Kumar Singh ◽

Ram Sarkar ◽

Mita Nasipuri

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Statistical Significance ◽

Document Image ◽

Statistical Significance Testing ◽

Script Identification ◽

Word Level ◽

Histograms Of Oriented Gradients ◽

Handwritten Text ◽

Identification Technique

Script identification is an appealing research interest in the field of document image analysis during the last few decades. The accurate recognition of the script is paramount to many post-processing steps such as automated document sorting, machine translation and searching of text written in a particular script in multilingual environment. For automatic processing of such documents through Optical Character Recognition (OCR) software, it is necessary to identify different script words of the documents before feeding them to the OCR of individual scripts. In this paper, a robust word-level handwritten script identification technique has been proposed using texture based features to identify the words written in any of the seven popular scripts namely, Bangla, Devanagari, Gurumukhi, Malayalam, Oriya, Telugu, and Roman. The texture based features comprise of a combination of Histograms of Oriented Gradients (HOG) and Moment invariants. The technique has been tested on 7000 handwritten text words in which each script contributes 1000 words. Based on the identification accuracies and statistical significance testing of seven well-known classifiers, Multi-Layer Perceptron (MLP) has been chosen as the final classifier which is then tested comprehensively using different folds and with different epoch sizes. The overall accuracy of the system is found to be 94.7% using 5-fold cross validation scheme, which is quite impressive considering the complexities and shape variations of the said scripts. This is an extended version of the paper described in (Singh et al., 2014).

Download Full-text

Toward the optimized crowdsourcing strategy for OCR post-correction

Aslib Journal of Information Management ◽

10.1108/ajim-07-2019-0189 ◽

2019 ◽

Vol 72 (2) ◽

pp. 179-197

Author(s):

Omri Suissa ◽

Avshalom Elmalech ◽

Maayan Zhitomirsky-Geffet

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Stage Structure ◽

Historical Documents ◽

Two Phase ◽

Content Type ◽

Historical Texts ◽

Efficiency Measures ◽

Historical Text ◽

Amazon's Mechanical Turk

Purpose Digitization of historical documents is a challenging task in many digital humanities projects. A popular approach for digitization is to scan the documents into images, and then convert images into text using optical character recognition (OCR) algorithms. However, the outcome of OCR processing of historical documents is usually inaccurate and requires post-processing error correction. The purpose of this paper is to investigate how crowdsourcing can be utilized to correct OCR errors in historical text collections, and which crowdsourcing methodology is the most effective in different scenarios and for various research objectives. Design/methodology/approach A series of experiments with different micro-task’s structures and text lengths were conducted with 753 workers on the Amazon’s Mechanical Turk platform. The workers had to fix OCR errors in a selected historical text. To analyze the results, new accuracy and efficiency measures were devised. Findings The analysis suggests that in terms of accuracy, the optimal text length is medium (paragraph-size) and the optimal structure of the experiment is two phase with a scanned image. In terms of efficiency, the best results were obtained when using longer text in the single-stage structure with no image. Practical implications The study provides practical recommendations to researchers on how to build the optimal crowdsourcing task for OCR post-correction. The developed methodology can also be utilized to create golden standard historical texts for automatic OCR post-correction. Originality/value This is the first attempt to systematically investigate the influence of various factors on crowdsourcing-based OCR post-correction and propose an optimal strategy for this process.

Download Full-text

CNN-Based Page Segmentation and Object Classification for Counting Population in Ottoman Archival Documentation

Journal of Imaging ◽

10.3390/jimaging6050032 ◽

2020 ◽

Vol 6 (5) ◽

pp. 32 ◽

Cited By ~ 1

Author(s):

Yekta Said Can ◽

M. Erdem Kabadayı

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Text Recognition ◽

Historical Documents ◽

Layout Analysis ◽

Page Segmentation ◽

Handwritten Text ◽

Handwritten Text Recognition ◽

Different Types ◽

Archival Documentation

Historical document analysis systems gain importance with the increasing efforts in the digitalization of archives. Page segmentation and layout analysis are crucial steps for such systems. Errors in these steps will affect the outcome of handwritten text recognition and Optical Character Recognition (OCR) methods, which increase the importance of the page segmentation and layout analysis. Degradation of documents, digitization errors, and varying layout styles are the issues that complicate the segmentation of historical documents. The properties of Arabic scripts such as connected letters, ligatures, diacritics, and different writing styles make it even more challenging to process Arabic script historical documents. In this study, we developed an automatic system for counting registered individuals and assigning them to populated places by using a CNN-based architecture. To evaluate the performance of our system, we created a labeled dataset of registers obtained from the first wave of population registers of the Ottoman Empire held between the 1840s and 1860s. We achieved promising results for classifying different types of objects and counting the individuals and assigning them to populated places.

Download Full-text

A Complete Optical Character Recognition Methodology for Historical Documents

2008 The Eighth IAPR International Workshop on Document Analysis Systems ◽

10.1109/das.2008.73 ◽

2008 ◽

Cited By ~ 31

Author(s):

G. Vamvakas ◽

B. Gatos ◽

N. Stamatopoulos ◽

S.J. Perantonis

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Historical Documents ◽

Optical Character

Download Full-text

Handwritten Bangla Numerical Digit Recognition Using Fine Regulated Deep Neural Network

Engineering International ◽

10.18034/ei.v9i2.551 ◽

2021 ◽

Vol 9 (2) ◽

pp. 73-84

Author(s):

Md. Shahadat Hossain ◽

Md. Anwar Hossain ◽

AFM Zainul Abadin ◽

Md. Manik Ahmed

Keyword(s):

Neural Network ◽

Character Recognition ◽

Optical Character Recognition ◽

Network Performance ◽

Deep Neural Network ◽

Recognition Accuracy ◽

Digit Recognition ◽

Optical Character ◽

Regularization Parameters ◽

Handwritten Recognition

The recognition of handwritten Bangla digit is providing significant progress on optical character recognition (OCR). It is a very critical task due to the similar pattern and alignment of handwriting digits. With the progress of modern research on optical character recognition, it is reducing the complexity of the classification task by several methods, a few problems encounter during recognition and wait to be solved with simpler methods. The modern emerging field of artificial intelligence is the Deep Neural Network, which promises a solid solution to these few handwritten recognition problems. This paper proposed a fine regulated deep neural network (FRDNN) for the handwritten numeric character recognition problem that uses convolutional neural network (CNN) models with regularization parameters which makes the model generalized by preventing the overfitting. This paper applied Traditional Deep Neural Network (TDNN) and Fine regulated deep neural network (FRDNN) models with a similar layer experienced on BanglaLekha-Isolated databases and the classification accuracies for the two models were 96.25% and 96.99%, respectively over 100 epochs. The network performance of the FRDNN model on the BanglaLekha-Isolated digit dataset was more robust and accurate than the TDNN model and depend on experimentation. Our proposed method is obtained a good recognition accuracy compared with other existing available methods.

Download Full-text

Handwritten Kannada Document Image Processing using Optical Character Recognition

IOSR Journal of Computer Engineering ◽

10.9790/0661-1804063947 ◽

2016 ◽

Vol 18 (04) ◽

pp. 39-47

Author(s):

Mayur M Patil ◽

Akkamahadevi R Hanni

Keyword(s):

Image Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Document Image ◽

Document Image Processing ◽

Optical Character

Download Full-text

Pashtu Numerals Recognition through Convolutional Neural Networks

Journal of Applied and Emerging Sciences ◽

10.36785/buitems.jaes.338 ◽

2019 ◽

pp. 91-96

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Convolutional Neural Networks ◽

Character Recognition ◽

Optical Character Recognition ◽

Recognition Accuracy ◽

Research Use ◽

Optical Character ◽

Classification Tasks ◽

Scanned Images

In the proposed paper we introduce a new Pashtu numerals dataset having handwritten scanned images. We make the dataset publically available for scientific and research use. Pashtu language is used by more than fifty million people both for oral and written communication, but still no efforts are devoted to the Optical Character Recognition (OCR) system for Pashtu language. We introduce a new method for handwritten numerals recognition of Pashtu language through the deep learning based models. We use convolutional neural networks (CNNs) both for features extraction and classification tasks. We assess the performance of the proposed CNNs based model and obtained recognition accuracy of 91.45%.

Download Full-text

A Novel Document Image Binarization For Optical Character Recognition

International Journal of Computer Applications Technology and Research ◽

10.7753/ijcatr0309.1006 ◽

2014 ◽

Vol 3 (9) ◽

pp. 559-563

Author(s):

Varada V M Abhinay ◽

P. Suresh Babu

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Document Image ◽

Image Binarization ◽

Optical Character ◽

Document Image Binarization

Download Full-text