script identification
Recently Published Documents


TOTAL DOCUMENTS

221
(FIVE YEARS 52)

H-INDEX

21
(FIVE YEARS 4)

2022 ◽  
pp. 811-822
Author(s):  
B.V. Dhandra ◽  
Satishkumar Mallappa ◽  
Gururaj Mukarambi

In this article, the exhaustive experiment is carried out to test the performance of the Segmentation based Fractal Texture Analysis (SFTA) features with nt = 4 pairs, and nt = 8 pairs, geometric features and their combinations. A unified algorithm is designed to identify the scripts of the camera captured bi-lingual document image containing International language English with each one of Hindi, Kannada, Telugu, Malayalam, Bengali, Oriya, Punjabi, and Urdu scripts. The SFTA algorithm decomposes the input image into a set of binary images from which the fractal dimension of the resulting regions are computed in order to describe the segmented texture patterns. This motivates use of the SFTA features as the texture features to identify the scripts of the camera-based document image, which has an effect of non-homogeneous illumination (Resolution). An experiment is carried on eleven scripts each with 1000 sample images of block sizes 128 × 128, 256 × 256, 512 × 512 and 1024 × 1024. It is observed that the block size 512 × 512 gives the maximum accuracy of 86.45% for Gujarathi and English script combination and is the optimal size. The novelty of this article is that unified algorithm is developed for the script identification of bilingual document images.


Author(s):  
Houda Gaddour ◽  
Slim Kanoun ◽  
Nicole Vincent

Text in scene images can provide useful and vital information for content-based image analysis. Therefore, text detection and script identification in images are an important task. In this paper, we propose a new method for text detection in natural scene images, particularly for Arabic text, based on a bottom-up approach where four principal steps can be highlighted. The detection of extremely stable and homogeneous regions of interest (ROIs) is based on the Color Stability and Homogeneity Regions (CSHR) proposed technique. These regions are then labeled as textual or non-textual ROI. This identification is based on a structural approach. The textual ROIs are grouped to constitute zones according to spatial relations between them. Finally, the textual or non-textual nature of the constituted zones is refined. This last identification is based on handcrafted features and on features built from a Convolutional Neural Network (CNN) after learning. The proposed method was evaluated on the databases used for text detection in natural scene images: the competitions organized in 2017 edition of the International Conference on Document Analysis and Recognition (ICDAR2017), the Urdu-text database and our Natural Scene Image Database for Arabic Text detection (NSIDAT) database. The obtained experimental results seem to be interesting.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Muhammad Yasir ◽  
Li Chen ◽  
Amna Khatoon ◽  
Muhammad Amir Malik ◽  
Fazeel Abid

Mixed script identification is a hindrance for automated natural language processing systems. Mixing cursive scripts of different languages is a challenge because NLP methods like POS tagging and word sense disambiguation suffer from noisy text. This study tackles the challenge of mixed script identification for mixed-code dataset consisting of Roman Urdu, Hindi, Saraiki, Bengali, and English. The language identification model is trained using word vectorization and RNN variants. Moreover, through experimental investigation, different architectures are optimized for the task associated with Long Short-Term Memory (LSTM), Bidirectional LSTM, Gated Recurrent Unit (GRU), and Bidirectional Gated Recurrent Unit (Bi-GRU). Experimentation achieved the highest accuracy of 90.17 for Bi-GRU, applying learned word class features along with embedding with GloVe. Moreover, this study addresses the issues related to multilingual environments, such as Roman words merged with English characters, generative spellings, and phonetic typing.


Author(s):  
Shubhankar Sharma ◽  
Vatsala Arora

The study of character research is an active area for research as it pertains a lot of challenges. Various pattern recognition techniques are being used every day. As there are so many writing styles available, development of OCR (Optical Character Recognition) for handwritten text is difficult. Therefore, several measures have to be taken to improve the recognition process so that the burden of computation can be decreased and the accuracy for pattern recognition can be increased. The main objective of this review was to recognize and analyze handwritten document images. In this paper, we present a scheme to identify different Indian scripts like Devanagari and Gurumukhi.


2021 ◽  
Author(s):  
Sukhandeep Kaur ◽  
Seema Bawa ◽  
Ravinder Kumar

Abstract Script identification at character level in handwritten documents is a challenging task for Gurumukhi and Latin scripts due to the presence of slightly similar, quite similar or at times confusing character pairs. Hence, it is found to be inadequate to use single feature set or just traditional feature sets and classifier in processing the handwritten documents. Due to the evolution of deep learning, the importance of traditional feature extraction approaches is somewhere neglected which is considered in this paper. This paper investigates machine learning and deep learning ensemble approaches at feature extraction and classification level for script identification. The approach here is: i. combining traditional and deep learning based features ii. evaluating various ensemble approaches using individual and combined feature sets to perform script identification iii. evaluating the pre-trained deep networks using transfer learning for script identification ’iv. finding the best combination of feature set and classifiers for script identification. Three different kinds of traditional features like Gabor filter, Gray Level Co-Occurrence Matrix (GLCM), Histograms of Oriented Gradiants (HOG) are employed. For deep learning pretrained deep networks like VGG19, ResNet50 and LeNet5 have been used as feature extractor. These individual and combined features are trained using classifiers like Support Vector Machines (SVM) , K nearest neighbor (KNN), Random Forest (rf) etc. Further many ensemble approaches like Voting,Boosting and Bagging are evaluated for script classification. Exhaustive experimental work resulted into the highest accuracy of 98.82% with features extracted from ResNet50 using transfer learning and bagging based ensemble classifier which is higher as compared to previously reported work.


Author(s):  
Mridul Ghosh ◽  
Himadri Mukherjee ◽  
Sk Md Obaidullah ◽  
K. C. Santosh ◽  
Nibaran Das ◽  
...  

2021 ◽  
Vol 91 ◽  
pp. 107043
Author(s):  
Ashwaq Khalil ◽  
Moath Jarrah ◽  
Mahmoud Al-Ayyoub ◽  
Yaser Jararweh

Author(s):  
Rajneesh Rani ◽  
Renu Dhir ◽  
Deepti Kakkar ◽  
Nonita Sharma

The identification of script in a document page image is the first step for an OCR system processing multi-script documents. In this multilingual/multiscript world, document processing systems relying on the OCR that need human involvement to select the appropriate OCR package is definitely undesirable and inefficient. The development of robust and efficient methods for automatic script identification of a document is a subject of major importance for automatic document processing in a multilingual/multiscript environment. Thus, the basic objective is to come up with some intuitive methods having straightforward implementation without compromising with efficiency. The aim of this work is to evaluate state-of-the-art feature extraction and classification techniques in the field of automatic script identification of printed and handwritten documents and to propose the best combination for the same.


Sign in / Sign up

Export Citation Format

Share Document