scholarly journals Machine Printed Text and Handwriting Identification in Noisy Document Images

Author(s):  
Yefeng Zheng ◽  
Huiping Li ◽  
David Doermann
Keyword(s):  
Author(s):  
Ángel Sánchez ◽  
José F. Vélez ◽  
Javier Sánchez ◽  
A. Belén Moreno

Author(s):  
Gulfeshan Parween

Abstract: In this paper, we present a scheme to develop to complete OCR system for printed text English Alphabet of Uppercase of different font and of different sizes so that we can use this system in Banking, Corporate, Legal industry and so on. OCR system consists of different modules like preprocessing, segmentation, feature extraction and recognition. In preprocessing step it is expected to include image gray level conversion, binary conversion etc. After finding out the feature of the segmented characters artificial neural network and can be used for Character Recognition purpose. Efforts have been made to improve the performance of character recognition using artificial neural network techniques. The proposed OCR system is capable of accepting printed document images from a file and implemented using MATLAB R2014a version. Key words: OCR, Printed text, Barcode recognition


Author(s):  
María José Castro-Bleda ◽  
Slavador España-Boquera ◽  
Francisco Zamora-Martínez

The field of off-line optical character recognition (OCR) has been a topic of intensive research for many years (Bozinovic, 1989; Bunke, 2003; Plamondon, 2000; Toselli, 2004). One of the first steps in the classical architecture of a text recognizer is preprocessing, where noise reduction and normalization take place. Many systems do not require a binarization step, so the images are maintained in gray-level quality. Document enhancement not only influences the overall performance of OCR systems, but it can also significantly improve document readability for human readers. In many cases, the noise of document images is heterogeneous, and a technique fitted for one type of noise may not be valid for the overall set of documents. One possible solution to this problem is to use several filters or techniques and to provide a classifier to select the appropriate one. Neural networks have been used for document enhancement (see (Egmont-Petersen, 2002) for a review of image processing with neural networks). One advantage of neural network filters for image enhancement and denoising is that a different neural filter can be automatically trained for each type of noise. This work proposes the clustering of neural network filters to avoid having to label training data and to reduce the number of filters needed by the enhancement system. An agglomerative hierarchical clustering algorithm of supervised classifiers is proposed to do this. The technique has been applied to filter out the background noise from an office (coffee stains and footprints on documents, folded sheets with degraded printed text, etc.).


Author(s):  
Rajmohan Pardeshi ◽  
Mallikarjun Hangarge ◽  
Srikanth Doddamani ◽  
K.C. Santosh
Keyword(s):  

Author(s):  
Konstantinos Zagoris ◽  
Ioannis Pratikakis ◽  
Apostolos Antonacopoulos ◽  
Basilis Gatos ◽  
Nikos Papamarkos

Author(s):  
HENRY S. BAIRD

A method for analyzing the structure of the white background in document images is described, along with applications to the problem of isolating blocks of machine-printed text. The approach is based on computational-geometry algorithms for off-line enumeration of maximal white rectangles and on-line rectangle unification. These support a fast, simple, and general heuristic for geometric layout segmentation, in which white space is covered greedily by rectangles until all text blocks are isolated. Design of the heuristic can be substantially automated by an analysis of the empirical statistical distribution of properties of covering rectangles: for example, the stopping rule can be chosen by Rosenblatt’s perceptron training algorithm. Experimental trials show good behavior on the large and useful class of textual Manhattan layouts. On complex layouts from English-language technical journals of many publishers, the method finds good segmentations in a uniform and nearly parameter-free manner. On a variety of non-Latin texts, some with vertical text lines, the method finds good segmentations without prior knowledge of page and text-line orientation.


Author(s):  
Yefeng Zheng ◽  
Huiping Li ◽  
D. Doermann
Keyword(s):  

Author(s):  
J. Tan ◽  
J.-H. Lai ◽  
P. Wang ◽  
N. Bi

Techniques to identify printed and handwritten text in scanned documents differ significantly. In this paper, we address the question of how to discriminate between each type of writing on registration forms. Registration-form documents consist of various type zones, such as printed text, handwriting, table, image, noise, etc., so segmenting the various zones is a challenge. We adopt herein an approach called “multiscale-region projection” to identify printed text and handwriting. An important aspect of our approach is the use of multiscale techniques to segment document images. A new set of projection features extracted from each zone is also proposed. The classification rules are mining and are used to discern printed text and table lines from handwritten text. The proposed system was tested on 11[Formula: see text]118 samples in two registration-form-image databases. Some possible measures of efficiency are computed, and in each case the proposed approach performs better than traditional methods.


This paper presents word spotting in handwritten documents based on multiple features. Multiple features are derived using Gabor, Histogram oriented gradient (HOG), Local binary pattern, texture filters and Morphological filters. The real time documents are heterogeneous in nature, for instance application forms, postal cards, railway reservations forms etc. includes handwritten and printed text with different scripts. To spot a word in such documents and retrieving them from a huge digitized repository is a challenging task. To address such issues word spotting based on multiple features is carried out with learning and without learning methods. In both the methods (learning and learning free) texture filters are exhibiting outstanding performance in terms of precision recall and f-measures. To confirm the capability of the proposed method, extensive experiments are made on publically available dataset i.e.GW20 and noted encouraging results compared to other contemporary works


Author(s):  
Htwe Pa Pa Win ◽  
Phyo Thu Thu Khine ◽  
Khin Nwe Ni Tun

Automatic machine-printed Optical Characters or texts Recognizers (OCR) are highly desirable for a multitude of modern IT applications, including Digital Library software. However, the state of the art OCR systems cannot do for Myanmar scripts as the language poses many challenges for document understanding. Therefore, the authors design an Optical Character Recognition System for Myanmar Printed Document (OCRMPD), with several proposed techniques that can automatically recognize Myanmar printed text from document images. In order to get more accurate system, the authors propose the method for isolation of the character image by using not only the projection methods but also structural analysis for wrongly segmented characters. To reveal the effectiveness of the segmentation technique, the authors follow a new hybrid feature extraction method and choose the SVM classifier for recognition of the character image. The proposed algorithms have been tested on a variety of Myanmar printed documents and the results of the experiments indicate that the methods can increase the segmentation accuracy as well as recognition rates.


Sign in / Sign up

Export Citation Format

Share Document