Categorization of Multilingual Text on Languages of Indic Script

AbstractIn any multi-script environment, handwritten script classification is an unavoidable pre-requisite before the document images are fed to their respective Optical Character Recognition (OCR) engines. Over the years, this complex pattern classification problem has been solved by researchers proposing various feature vectors mostly having large dimensions, thereby increasing the computation complexity of the whole classification model. Feature Selection (FS) can serve as an intermediate step to reduce the size of the feature vectors by restricting them only to the essential and relevant features. In the present work, we have addressed this issue by introducing a new FS algorithm, called Hybrid Swarm and Gravitation-based FS (HSGFS). This algorithm has been applied over three feature vectors introduced in the literature recently—Distance-Hough Transform (DHT), Histogram of Oriented Gradients (HOG), and Modified log-Gabor (MLG) filter Transform. Three state-of-the-art classifiers, namely, Multi-Layer Perceptron (MLP), K-Nearest Neighbour (KNN), and Support Vector Machine (SVM), are used to evaluate the optimal subset of features generated by the proposed FS model. Handwritten datasets at block, text line, and word level, consisting of officially recognized 12 Indic scripts, are prepared for experimentation. An average improvement in the range of 2–5% is achieved in the classification accuracy by utilizing only about 75–80% of the original feature vectors on all three datasets. The proposed method also shows better performance when compared to some popularly used FS models. The codes used for implementing HSGFS can be found in the following Github link: https://github.com/Ritam-Guha/HSGFS.

Download Full-text

Multilingual Text Analysis: History, Tasks, and Challenges

Multilingual Text Analysis ◽

10.1142/9789813274884_0001 ◽

2019 ◽

pp. 1-29

Author(s):

Natalia Vanetik ◽

Marina Litvak

Keyword(s):

Text Analysis ◽

Multilingual Text

Download Full-text

Domain-specific Evaluation Dataset Generator for Multilingual Text Analysis

Journal of Intelligent Systems with Applications ◽

10.54856/jiswa.201912084 ◽

2019 ◽

pp. 140-147

Author(s):

Emrah Inan ◽

Vahab Mostafapour ◽

Fatif Tekbacak

Keyword(s):

Text Analysis ◽

General Purpose ◽

Entity Linking ◽

Named Entity ◽

Domain Specific ◽

Benchmark Datasets ◽

Concise Information ◽

Multilingual Text ◽

The Given ◽

Specific Evaluation

Web enables to retrieve concise information about specific entities including people, organizations, movies and their features. Additionally, large amount of Web resources generally lies on a unstructured form and it tackles to find critical information for specific entities. Text analysis approaches such as Named Entity Recognizer and Entity Linking aim to identify entities and link them to relevant entities in the given knowledge base. To evaluate these approaches, there are a vast amount of general purpose benchmark datasets. However, it is difficult to evaluate domain-specific approaches due to lack of evaluation datasets for specific domains. This study presents WeDGeM that is a multilingual evaluation set generator for specific domains exploiting Wikipedia category pages and DBpedia hierarchy. Also, Wikipedia disambiguation pages are used to adjust the ambiguity level of the generated texts. Based on this generated test data, a use case for well-known Entity Linking systems supporting Turkish texts are evaluated in the movie domain.

Download Full-text

Recognition of Handwritten Indic Script Numerals Using Mojette Transform

Advances in Intelligent Systems and Computing - Proceedings of the First International Conference on Intelligent Computing and Communication ◽

10.1007/978-981-10-2035-3_47 ◽

2016 ◽

pp. 459-466 ◽

Cited By ~ 2

Author(s):

Pawan Kumar Singh ◽

Supratim Das ◽

Ram Sarkar ◽

Mita Nasipuri

Keyword(s):

Indic Script ◽

Mojette Transform

Download Full-text

Multilingual Web Content Mining

Intelligent Agents for Data Mining and Information Retrieval ◽

10.4018/978-1-59140-194-0.ch006 ◽

2004 ◽

pp. 88-100

Author(s):

Rowena Chau ◽

Chung-Hsing Yeh

Keyword(s):

Information Filtering ◽

User Profile ◽

Linguistic Knowledge ◽

Web Content ◽

Self Organizing Maps ◽

Web Documents ◽

Web Content Mining ◽

Concept Space ◽

Content Mining ◽

Multilingual Text

This chapter presents a novel user-oriented, concept-based approach to multilingual web content mining using self-organizing maps. The multilingual linguistic knowledge required for multilingual web content mining is made available by encoding all multilingual concept-term relationships using a multilingual concept space. With this linguistic knowledge base, a concept-based multilingual text classifier is developed. It reveals the conceptual content of multilingual web documents and forms concept categories of multilingual web documents on a concept-based browsing interface. To personalize multilingual web content mining, a concept-based user profile is generated from a user’s bookmark file to highlight the user’s topics of information interest on the browsing interface. As such, both explorative browsing and user-oriented, concept-focused information filtering in multilingual web are facilitated.

Download Full-text