Word-level script identification for handwritten Indic scripts

Script identification is an appealing research interest in the field of document image analysis during the last few decades. The accurate recognition of the script is paramount to many post-processing steps such as automated document sorting, machine translation and searching of text written in a particular script in multilingual environment. For automatic processing of such documents through Optical Character Recognition (OCR) software, it is necessary to identify different script words of the documents before feeding them to the OCR of individual scripts. In this paper, a robust word-level handwritten script identification technique has been proposed using texture based features to identify the words written in any of the seven popular scripts namely, Bangla, Devanagari, Gurumukhi, Malayalam, Oriya, Telugu, and Roman. The texture based features comprise of a combination of Histograms of Oriented Gradients (HOG) and Moment invariants. The technique has been tested on 7000 handwritten text words in which each script contributes 1000 words. Based on the identification accuracies and statistical significance testing of seven well-known classifiers, Multi-Layer Perceptron (MLP) has been chosen as the final classifier which is then tested comprehensively using different folds and with different epoch sizes. The overall accuracy of the system is found to be 94.7% using 5-fold cross validation scheme, which is quite impressive considering the complexities and shape variations of the said scripts. This is an extended version of the paper described in (Singh et al., 2014).

Download Full-text

Automatic Indic script identification from handwritten documents: page, block, line and word-level approach

International Journal of Machine Learning and Cybernetics ◽

10.1007/s13042-017-0702-8 ◽

2017 ◽

Vol 10 (1) ◽

pp. 87-106 ◽

Cited By ~ 14

Author(s):

Sk Md Obaidullah ◽

K. C. Santosh ◽

Chayan Halder ◽

Nibaran Das ◽

Kaushik Roy

Keyword(s):

Handwritten Documents ◽

Script Identification ◽

Word Level ◽

Indic Script

Download Full-text

A new dataset of word-level offline handwritten numeral images from four official Indic scripts and its benchmarking using image transform fusion

International Journal of Intelligent Engineering Informatics ◽

10.1504/ijiei.2016.074497 ◽

2016 ◽

Vol 4 (1) ◽

pp. 1 ◽

Cited By ~ 1

Author(s):

Sk Md Obaidullah ◽

Chayan Halder ◽

Nibaran Das ◽

Kaushik Roy

Keyword(s):

Word Level ◽

Image Transform ◽

Indic Scripts

Download Full-text

Word level script identification for scanned document images

10.1117/12.530538 ◽

2003 ◽

Cited By ~ 12

Author(s):

Huanfeng Ma ◽

David Doermann

Keyword(s):

Document Images ◽

Script Identification ◽

Word Level

Download Full-text

Deep Learning for Word-Level Handwritten Indic Script Identification

Communications in Computer and Information Science - Recent Trends in Image Processing and Pattern Recognition ◽

10.1007/978-981-16-0507-9_42 ◽

2021 ◽

pp. 499-510

Author(s):

Soumya Ukil ◽

Swarnendu Ghosh ◽

Sk Md Obaidullah ◽

K. C. Santosh ◽

Kaushik Roy ◽

...

Keyword(s):

Deep Learning ◽

Script Identification ◽

Word Level ◽

Indic Script

Download Full-text

Line Parameter based Word-Level Indic Script Identification System

International Journal of Computer Vision and Image Processing ◽

10.4018/ijcvip.2016070102 ◽

2016 ◽

Vol 6 (2) ◽

pp. 18-41 ◽

Cited By ~ 3

Author(s):

Pawan Kumar Singh ◽

Supratim Das ◽

Ram Sarkar ◽

Mita Nasipuri

Keyword(s):

Hough Transform ◽

Identification Accuracy ◽

Distance Transform ◽

Multi Layer Perceptron ◽

Script Identification ◽

Spatial Features ◽

Word Level ◽

Line Parameter ◽

Word Images ◽

Mlp Classifier

In this paper, a line parameter based approach is presented to identify the handwritten scripts written in eight popular scripts. Since Optical Character Recognition (OCR) engines are usually script-dependent, automatic text recognition in multi-script environment requires a pre-processing module that helps identifying the scripts before processing the same through the respective OCR engine. The work becomes more challenging when it deals with handwritten document which is still a less explored research area. In this paper, a line parameter based approach is presented to identify the handwritten scripts written in eight popular scripts namely, Bangla, Devanagari, Gujarati, Gurumukhi, Manipuri, Oriya, Urdu, and Roman. A combination of Hough transform (HT) and Distance transform (DT) is used to extract the directional spatial features based on the line parameter. Experimentations are performed at word-level using multiple classifiers on a dataset of 12000 handwritten word images and Multi Layer Perceptron (MLP) classifier is found to be the best performing classifier showing an identification accuracy of 95.28%. The performance of the present technique is also compared with those of other state-of-the-art script identification methods on the same database. A combination of Hough transform (HT) and Distance transform (DT) is used to extract the directional spatial features based on the line parameter. Experimentation are performed at word-level on a total dataset of 12000 handwritten word images and Multi Layer Perceptron (MLP) classifier is found to be the best performing classifier showing an identification accuracy of 95.28%.

Download Full-text

A Corpus of Word-Level Offline Handwritten Numeral Images from Official Indic Scripts

Advances in Intelligent Systems and Computing - Proceedings of the Second International Conference on Computer and Communication Technologies ◽

10.1007/978-81-322-2517-1_67 ◽

2015 ◽

pp. 703-711 ◽

Cited By ~ 1

Author(s):

Sk Md Obaidullah ◽

Chayan Halder ◽

Nibaran Das ◽

Kaushik Roy

Keyword(s):

Word Level ◽

Indic Scripts

Download Full-text

Word level multi-script identification

Pattern Recognition Letters ◽

10.1016/j.patrec.2008.01.027 ◽

2008 ◽

Vol 29 (9) ◽

pp. 1218-1229 ◽

Cited By ~ 72

Author(s):

Peeta Basa Pati ◽

A.G. Ramakrishnan

Keyword(s):

Script Identification ◽

Word Level

Download Full-text

Smart Device Authentication Based on Online Handwritten Script Identification and Word Recognition in Indic Scripts Using Zone-Wise Features

International Journal of Information System Modeling and Design ◽

10.4018/ijismd.2018010102 ◽

2018 ◽

Vol 9 (1) ◽

pp. 21-55 ◽

Cited By ~ 3

Author(s):

Rajib Ghosh ◽

Partha Pratim Roy ◽

Prabhat Kumar

Keyword(s):

Factor Analysis ◽

Smart Devices ◽

Smart Device ◽

Basic Form ◽

Native Languages ◽

Script Identification ◽

Vital Component ◽

Authentication Schemes ◽

Secure Authentication ◽

Indic Scripts

Secure authentication is a vital component for device security. The most basic form of authentication is by using passwords. With the evolution of smart devices, selecting stronger and unbreakable passwords have become a challenging task. Such passwords if written in native languages tend to offer improved security since attackers having no knowledge of such scripts finding it hard to crack. This article proposes two zone-wise feature extraction approaches - zone-wise structural and directional (ZSD) and zone-wise slopes of dominant points (ZSDP), to recognize online handwritten script and word in four major Indic scripts - Devanagari, Bengali, Telugu and Tamil. These features have been used separately and in combination in HMM-based platform for recognition purpose. The dimension reduction of the ZSD-ZSDP combination with factor analysis has shown the best performance in all the four scripts. This work can be utilized for setting up the authentication schemes with the Indic scripts' passwords thus rendering it difficult to crack by hackers having no knowledge of such scripts.

Download Full-text