A Novel Noise-removal Technique for Document Images

For developing a high quality Optical Character Recognition (OCR) system removal of noise from the document image is an utmost important step. To make this possible, filtering plays a significant role. Although mean and median filters, the two well-known statistical filtering techniques, are used commonly but sometimes these filters may fail to produce noise-free images or sometimes may introduce distortions on the characters in the form of gulfs or capes. In the work reported here, we have developed a new filtering technique, called Middle of Modal Class (MMC), for smoothing the input images. This filtering technique is applicable for both the noisy and noise free text document image at the same time. We have also compared our results with mean and median filters, and have achieved better result.

Download Full-text

Skew detection based on vertical projection in latin character recognition of text document image

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.44.26983 ◽

2018 ◽

Vol 7 (4.44) ◽

pp. 198

Author(s):

Ronny Susanto ◽

Farica P. Putri ◽

Y. Widya Wiratama

Keyword(s):

Error Rate ◽

Character Recognition ◽

Optical Character Recognition ◽

Rotation Angle ◽

Document Image ◽

Computational Time ◽

Vertical Projection ◽

Word Error Rate ◽

Skew Detection ◽

Text Document

The accuracy of Optical Character Recognition is deeply affected by the skew of the image. Skew detection & correction is one of the steps in OCR preprocessing to detect and correct the skew of document image. This research measures the effect of Combined Vertical Projection skew detection method to the accuracy of OCR. Accuracy of OCR is measured in Character Error Rate, Word Error Rate, and Word Error Rate (Order Independent). This research also measures the computational time needed in Combined Vertical Projection with different iteration. The experiment of Combined Vertical Projection is conducted by using iteration 0.5, 1, and 2 with rotation angle within -10 until 10 degrees. The experiment results show that the use of Combined Vertical Projection could lower the Character Error Rate, Word Error Rate, and Word Error Rate (Order Independent) up to 35.53, 34.51, and 32.74 percent, respectively. Using higher iteration value could lower the computational time but also decrease the accuracy of OCR.

Download Full-text

Development of the documents comparison module for an electronic document management system

Information Technology and Nanotechnology ◽

10.18287/1613-0073-2019-2416-527-533 ◽

2019 ◽

pp. 527-533

Author(s):

M A Mikheev ◽

P Y Yakimov

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Document Management ◽

Electronic Document ◽

Text Documents ◽

Text Document ◽

Document Management System ◽

Optical Character ◽

Electronic Document Management ◽

Scanned Image

The article is devoted to solving the problem of document versions comparison in electronic document management systems. Systems-analogues were considered, the process of comparing text documents was studied. In order to recognize the text on the scanned image, the technology of optical character recognition and its implementation — Tesseract library were chosen. The Myers algorithm is applied to compare received texts. The software implementation of the text document comparison module was implemented using the solutions described above.

Download Full-text

Optical Character Recognition of Indian Language Manuscripts using Convolutional Neural Networks

Design Engineering ◽

10.17762/de.v2021i3.7789 ◽

2021 ◽

pp. 894-911

Author(s):

Bhavesh Kataria, Dr. Harikrishna B. Jethva

Keyword(s):

Neural Networks ◽

Character Recognition ◽

Optical Character Recognition ◽

Short Term Memory ◽

Short Term ◽

Indian Language ◽

Term Memory ◽

Text Document ◽

Optical Character ◽

Long Short Term Memory

India's constitution has 22 languages written in 17 different scripts. These materials have a limited lifespan, and as generations pass, these materials deteriorate, and the vital knowledge is lost. This work uses digital texts to convey information to future generations. Optical Character Recognition (OCR) helps extract information from scanned manuscripts (printed text). This paper proposes a simple and effective solution of optical character recognition (OCR) Sanskrit Character from text document images using long short-term memory (LSTM) and neural networks of Sanskrit Characters. Existing methods focuses only upon the single touching characters. But our main focus is to design a robust method using Bidirectional Long Short-Term Memory (BLSTM) architecture for overlapping lines, touching characters in middle and upper zone and half character which would increase the accuracy of the present OCR system for recognition of poorly maintained Sanskrit literature.

Download Full-text

Word-Level Script Identification Using Texture Based Features

International Journal of System Dynamics Applications ◽

10.4018/ijsda.2015040105 ◽

2015 ◽

Vol 4 (2) ◽

pp. 74-94

Author(s):

Pawan Kumar Singh ◽

Ram Sarkar ◽

Mita Nasipuri

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Statistical Significance ◽

Document Image ◽

Statistical Significance Testing ◽

Script Identification ◽

Word Level ◽

Histograms Of Oriented Gradients ◽

Handwritten Text ◽

Identification Technique

Script identification is an appealing research interest in the field of document image analysis during the last few decades. The accurate recognition of the script is paramount to many post-processing steps such as automated document sorting, machine translation and searching of text written in a particular script in multilingual environment. For automatic processing of such documents through Optical Character Recognition (OCR) software, it is necessary to identify different script words of the documents before feeding them to the OCR of individual scripts. In this paper, a robust word-level handwritten script identification technique has been proposed using texture based features to identify the words written in any of the seven popular scripts namely, Bangla, Devanagari, Gurumukhi, Malayalam, Oriya, Telugu, and Roman. The texture based features comprise of a combination of Histograms of Oriented Gradients (HOG) and Moment invariants. The technique has been tested on 7000 handwritten text words in which each script contributes 1000 words. Based on the identification accuracies and statistical significance testing of seven well-known classifiers, Multi-Layer Perceptron (MLP) has been chosen as the final classifier which is then tested comprehensively using different folds and with different epoch sizes. The overall accuracy of the system is found to be 94.7% using 5-fold cross validation scheme, which is quite impressive considering the complexities and shape variations of the said scripts. This is an extended version of the paper described in (Singh et al., 2014).

Download Full-text

Hindi Text Document Classification System Using SVM and Fuzzy

International Journal of Rough Sets and Data Analysis ◽

10.4018/ijrsda.2018100101 ◽

2018 ◽

Vol 5 (4) ◽

pp. 1-31 ◽

Cited By ~ 8

Author(s):

Shalini Puri ◽

Satya Prakash Singh

Keyword(s):

Classification System ◽

Character Recognition ◽

Optical Character Recognition ◽

Document Classification ◽

Data Availability ◽

Support Vector ◽

Handwritten Documents ◽

Text Document ◽

Survey Report ◽

Text Document Classification

In recent years, many information retrieval, character recognition, and feature extraction methodologies in Devanagari and especially in Hindi have been proposed for different domain areas. Due to enormous scanned data availability and to provide an advanced improvement of existing Hindi automated systems beyond optical character recognition, a new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic is introduced. This first pre-processes and then classifies textual imaged documents into predefined categories. With this concept, this article depicts a feasibility study of such systems with the relevance of Hindi, a survey report of statistical measurements of Hindi keywords obtained from different sources, and the inherent challenges found in printed and handwritten documents. The technical reviews are provided and graphically represented to compare many parameters and estimate contents, forms and classifiers used in various existing techniques.

Download Full-text

Handwritten Kannada Document Image Processing using Optical Character Recognition

IOSR Journal of Computer Engineering ◽

10.9790/0661-1804063947 ◽

2016 ◽

Vol 18 (04) ◽

pp. 39-47

Author(s):

Mayur M Patil ◽

Akkamahadevi R Hanni

Keyword(s):

Image Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Document Image ◽

Document Image Processing ◽

Optical Character

Download Full-text

Corpus-based technique for improving Arabic OCR system

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v21.i1.pp233-241 ◽

2021 ◽

Vol 21 (1) ◽

pp. 233

Author(s):

Ahmed Hussain Aliwy ◽

Basheer Al-Sadawi

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Language Model ◽

Arabic Language ◽

Document Images ◽

Statistical Language Model ◽

Text Document ◽

Optical Character ◽

Arabic Ocr

<p><span>An optical character recognition (OCR) refers to a process of converting the text document images into editable and searchable text. OCR process poses several challenges in particular in the Arabic language due to it has caused a high percentage of errors. In this paper, a method, to improve the outputs of the Arabic Optical character recognition (AOCR) Systems is suggested based on a statistical language model built from the available huge corpora. This method includes detecting and correcting non-word and real words error according to the context of the word in the sentence. The results show that the percentage of improvement in the results is up to (98%) as a new accuracy for AOCR output. </span></p>

Download Full-text

A Novel Document Image Binarization For Optical Character Recognition

International Journal of Computer Applications Technology and Research ◽

10.7753/ijcatr0309.1006 ◽

2014 ◽

Vol 3 (9) ◽

pp. 559-563

Author(s):

Varada V M Abhinay ◽

P. Suresh Babu

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Document Image ◽

Image Binarization ◽

Optical Character ◽

Document Image Binarization

Download Full-text

SEDIQA: Sound Emitting Document Image Quality Assessment in a Reading Aid for the Visually Impaired

10.20944/preprints202107.0200.v1 ◽

2021 ◽

Author(s):

Jane Courtney

Keyword(s):

Image Quality ◽

Quality Assessment ◽

Character Recognition ◽

Optical Character Recognition ◽

Visually Impaired ◽

Image Quality Assessment ◽

Document Image ◽

Read Aloud ◽

Reference Image ◽

Reading Aids

For Visually impaired People (VIPs), the ability to convert text to sound can mean a new level of independence or the simple joy of a good book. With significant advances in Optical Character Recognition (OCR) in recent years, a number of reading aids are appearing on the market. These reading aids convert images captured by a camera to text which can then be read aloud. However, all of these reading aids suffer from a key issue – the user must be able to visually target the text and capture an image of sufficient quality for the OCR algorithm to function – no small task for VIPs. In this work, a Sound-Emitting Document Image Quality Assessment metric (SEDIQA) is proposed which allows the user to hear the quality of the text image and automatically captures the best image for OCR accuracy. This work also includes testing of OCR performance against image degradations, to identify the most significant contributors to accuracy reduction. The proposed No-Reference Image Quality Assessor (NR-IQA) is validated alongside established NR-IQAs and this work includes insights into the performance of these NR-IQAs on document images.

Download Full-text

Exposure Bracketing Techniques for Camera Document Image Enhancement

Applied Sciences ◽

10.3390/app9214529 ◽

2019 ◽

Vol 9 (21) ◽

pp. 4529

Author(s):

Tao Liu ◽

Hao Liu ◽

Yingying Wu ◽

Bo Yin ◽

Zhiqiang Wei

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Document Image ◽

Automatic Registration ◽

Digital Cameras ◽

Document Images ◽

Registration Method ◽

Lighting Conditions ◽

High Quality Image ◽

Multiple Document

Capturing document images using digital cameras in uneven lighting conditions is challenging, leading to poorly captured images, which hinders the processing that follows, such as Optical Character Recognition (OCR). In this paper, we propose the use of exposure bracketing techniques to solve this problem. Instead of capturing one image, we used several images that were captured with different exposure settings and used the exposure bracketing technique to generate a high-quality image that incorporates useful information from each image. We found that this technique can enhance image quality and provides an effective way of improving OCR accuracy. Our contributions in this paper are two-fold: (1) a preprocessing chain that uses exposure bracketing techniques for document images is discussed, and an automatic registration method is proposed to find the geometric disparity between multiple document images, which lays the foundation for exposure bracketing; (2) several representative exposure bracketing algorithms are incorporated in the processing chain and their performances are evaluated and compared.

Download Full-text