Word level script identification for scanned document images

2019 ◽

Vol 9 (2) ◽

pp. 3896-3901

Keyword(s):

Efficient Method ◽

Experimental Results ◽

Small Scale ◽

Document Images ◽

Feature Description ◽

Script Identification ◽

Word Level ◽

Key Points ◽

Feature Based ◽

Handwritten Document

SIFT and LBP are two popular techniques used for obtaining “feature description" of the object. SIFT identifies key points that are locations with distinct image information and robust to scaling and rotation whereas, LBP transforms an image into an array of integer labels describing small scale appearance of the image. In this paper, we present an efficient method wherein “feature description” of handwritten document images at word level are computed using SIFT and LBP. Identification of script type is done using KNN and SVM classifiers. Experimental results show that the performance of SVM is better over KNN. Further, the proposed method is compared with other methods in the literature to demonstrate the efficacy of the proposed method.

Download Full-text

Word-Level Thirteen Official Indic Languages Database for Script Identification in Multi-script Documents

Communications in Computer and Information Science - Recent Trends in Image Processing and Pattern Recognition ◽

10.1007/978-981-10-4859-3_2 ◽

2017 ◽

pp. 16-27

Author(s):

Sk Md Obaidullah ◽

K. C. Santosh ◽

Chayan Halder ◽

Nibaran Das ◽

Kaushik Roy

Keyword(s):

Script Identification ◽

Word Level

Download Full-text

Word-Level Script Identification Using Texture Based Features

International Journal of System Dynamics Applications ◽

10.4018/ijsda.2015040105 ◽

2015 ◽

Vol 4 (2) ◽

pp. 74-94

Author(s):

Pawan Kumar Singh ◽

Ram Sarkar ◽

Mita Nasipuri

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Statistical Significance ◽

Document Image ◽

Statistical Significance Testing ◽

Script Identification ◽

Word Level ◽

Histograms Of Oriented Gradients ◽

Handwritten Text ◽

Identification Technique

Script identification is an appealing research interest in the field of document image analysis during the last few decades. The accurate recognition of the script is paramount to many post-processing steps such as automated document sorting, machine translation and searching of text written in a particular script in multilingual environment. For automatic processing of such documents through Optical Character Recognition (OCR) software, it is necessary to identify different script words of the documents before feeding them to the OCR of individual scripts. In this paper, a robust word-level handwritten script identification technique has been proposed using texture based features to identify the words written in any of the seven popular scripts namely, Bangla, Devanagari, Gurumukhi, Malayalam, Oriya, Telugu, and Roman. The texture based features comprise of a combination of Histograms of Oriented Gradients (HOG) and Moment invariants. The technique has been tested on 7000 handwritten text words in which each script contributes 1000 words. Based on the identification accuracies and statistical significance testing of seven well-known classifiers, Multi-Layer Perceptron (MLP) has been chosen as the final classifier which is then tested comprehensively using different folds and with different epoch sizes. The overall accuracy of the system is found to be 94.7% using 5-fold cross validation scheme, which is quite impressive considering the complexities and shape variations of the said scripts. This is an extended version of the paper described in (Singh et al., 2014).

Download Full-text

Script Identification of Camera Based Bilingual Document Images Using SFTA Features

10.4018/978-1-6684-3690-5.ch040 ◽

2022 ◽

pp. 811-822

Author(s):

B.V. Dhandra ◽

Satishkumar Mallappa ◽

Gururaj Mukarambi

Keyword(s):

Block Size ◽

Texture Features ◽

Input Image ◽

Document Image ◽

Optimal Size ◽

Document Images ◽

Binary Images ◽

Script Identification ◽

Unified Algorithm ◽

Block Sizes

In this article, the exhaustive experiment is carried out to test the performance of the Segmentation based Fractal Texture Analysis (SFTA) features with nt = 4 pairs, and nt = 8 pairs, geometric features and their combinations. A unified algorithm is designed to identify the scripts of the camera captured bi-lingual document image containing International language English with each one of Hindi, Kannada, Telugu, Malayalam, Bengali, Oriya, Punjabi, and Urdu scripts. The SFTA algorithm decomposes the input image into a set of binary images from which the fractal dimension of the resulting regions are computed in order to describe the segmented texture patterns. This motivates use of the SFTA features as the texture features to identify the scripts of the camera-based document image, which has an effect of non-homogeneous illumination (Resolution). An experiment is carried on eleven scripts each with 1000 sample images of block sizes 128 × 128, 256 × 256, 512 × 512 and 1024 × 1024. It is observed that the block size 512 × 512 gives the maximum accuracy of 86.45% for Gujarathi and English script combination and is the optimal size. The novelty of this article is that unified algorithm is developed for the script identification of bilingual document images.

Download Full-text

Automatic Indic script identification from handwritten documents: page, block, line and word-level approach

International Journal of Machine Learning and Cybernetics ◽

10.1007/s13042-017-0702-8 ◽

2017 ◽

Vol 10 (1) ◽

pp. 87-106 ◽

Cited By ~ 14

Author(s):

Sk Md Obaidullah ◽

K. C. Santosh ◽

Chayan Halder ◽

Nibaran Das ◽

Kaushik Roy

Keyword(s):

Handwritten Documents ◽

Script Identification ◽

Word Level ◽

Indic Script

Download Full-text

Word-level script identification for handwritten Indic scripts

2015 13th International Conference on Document Analysis and Recognition (ICDAR) ◽

10.1109/icdar.2015.7333932 ◽

2015 ◽

Cited By ~ 9

Author(s):

Pawan Kumar Singh ◽

Ram Sarkar ◽

Mita Nasipuri ◽

David Doermann

Keyword(s):

Script Identification ◽

Word Level ◽

Indic Scripts

Download Full-text

Deep Learning for Word-Level Handwritten Indic Script Identification

Communications in Computer and Information Science - Recent Trends in Image Processing and Pattern Recognition ◽

10.1007/978-981-16-0507-9_42 ◽

2021 ◽

pp. 499-510

Author(s):

Soumya Ukil ◽

Swarnendu Ghosh ◽

Sk Md Obaidullah ◽

K. C. Santosh ◽

Kaushik Roy ◽

...

Keyword(s):

Deep Learning ◽

Script Identification ◽

Word Level ◽

Indic Script

Download Full-text

Line Parameter based Word-Level Indic Script Identification System

International Journal of Computer Vision and Image Processing ◽

10.4018/ijcvip.2016070102 ◽

2016 ◽

Vol 6 (2) ◽

pp. 18-41 ◽

Cited By ~ 3

Author(s):

Pawan Kumar Singh ◽

Supratim Das ◽

Ram Sarkar ◽

Mita Nasipuri

Keyword(s):

Hough Transform ◽

Identification Accuracy ◽

Distance Transform ◽

Multi Layer Perceptron ◽

Script Identification ◽

Spatial Features ◽

Word Level ◽

Line Parameter ◽

Word Images ◽

Mlp Classifier

In this paper, a line parameter based approach is presented to identify the handwritten scripts written in eight popular scripts. Since Optical Character Recognition (OCR) engines are usually script-dependent, automatic text recognition in multi-script environment requires a pre-processing module that helps identifying the scripts before processing the same through the respective OCR engine. The work becomes more challenging when it deals with handwritten document which is still a less explored research area. In this paper, a line parameter based approach is presented to identify the handwritten scripts written in eight popular scripts namely, Bangla, Devanagari, Gujarati, Gurumukhi, Manipuri, Oriya, Urdu, and Roman. A combination of Hough transform (HT) and Distance transform (DT) is used to extract the directional spatial features based on the line parameter. Experimentations are performed at word-level using multiple classifiers on a dataset of 12000 handwritten word images and Multi Layer Perceptron (MLP) classifier is found to be the best performing classifier showing an identification accuracy of 95.28%. The performance of the present technique is also compared with those of other state-of-the-art script identification methods on the same database. A combination of Hough transform (HT) and Distance transform (DT) is used to extract the directional spatial features based on the line parameter. Experimentation are performed at word-level on a total dataset of 12000 handwritten word images and Multi Layer Perceptron (MLP) classifier is found to be the best performing classifier showing an identification accuracy of 95.28%.

Download Full-text

Word level multi-script identification

Pattern Recognition Letters ◽

10.1016/j.patrec.2008.01.027 ◽

2008 ◽

Vol 29 (9) ◽

pp. 1218-1229 ◽

Cited By ~ 72

Author(s):

Peeta Basa Pati ◽

A.G. Ramakrishnan

Keyword(s):

Script Identification ◽

Word Level

Download Full-text

Feature Selection Using Harmony Search for Script Identification from Handwritten Document Images

Journal of Intelligent Systems ◽

10.1515/jisys-2016-0070 ◽

2018 ◽

Vol 27 (3) ◽

pp. 465-488 ◽

Cited By ~ 5

Author(s):

Pawan Kumar Singh ◽

Supratim Das ◽

Ram Sarkar ◽

Mita Nasipuri

Keyword(s):

Feature Selection ◽

Harmony Search ◽

Distribution Networks ◽

Feature Subset Selection ◽

Support Vector ◽

Feature Subset ◽

Document Images ◽

Global Features ◽

Script Identification ◽

Handwritten Document

Abstract The feature selection process can be considered a problem of global combinatorial optimization in machine learning, which reduces the irrelevant, noisy, and non-contributing features, resulting in acceptable classification accuracy. Harmony search algorithm (HSA) is an evolutionary algorithm that is applied to various optimization problems such as scheduling, text summarization, water distribution networks, vehicle routing, etc. This paper presents a hybrid approach based on support vector machine and HSA for wrapper feature subset selection. This approach is used to select an optimized set of features from an initial set of features obtained by applying Modified log-Gabor filters on prepartitioned rectangular blocks of handwritten document images written in either of 12 official Indic scripts. The assessment justifies the need of feature selection for handwritten script identification where local and global features are computed without knowing the exact importance of features. The proposed approach is also compared with four well-known evolutionary algorithms, namely genetic algorithm, particle swarm optimization, tabu search, ant colony optimization, and two statistical feature dimensionality reduction techniques, namely greedy attribute search and principal component analysis. The acquired results show that the optimal set of features selected using HSA gives better accuracy in handwritten script recognition.

Download Full-text

Word level script identification for scanned document images

Multi-Feature based Handwritten Script Identification at word level

Word-Level Thirteen Official Indic Languages Database for Script Identification in Multi-script Documents

Word-Level Script Identification Using Texture Based Features

Script Identification of Camera Based Bilingual Document Images Using SFTA Features

Automatic Indic script identification from handwritten documents: page, block, line and word-level approach

Word-level script identification for handwritten Indic scripts

Deep Learning for Word-Level Handwritten Indic Script Identification

Line Parameter based Word-Level Indic Script Identification System

Word level multi-script identification

Feature Selection Using Harmony Search for Script Identification from Handwritten Document Images

Export Citation Format