scholarly journals Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations

Author(s):  
Md. Rajib Hossain ◽  
Mohammed Moshiul Hoque

Distributional word vector representation orword embedding has become an essential ingredient in many natural language processing (NLP) tasks such as machine translation, document classification, information retrieval andquestion answering. Investigation of embedding model helps to reduce the feature space and improves textual semantic as well as syntactic relations.This paper presents three embedding techniques (such as Word2Vec, GloVe, and FastText) with different hyperparameters implemented on a Bengali corpusconsists of180 million words. The performance of the embedding techniques is evaluated with extrinsic and intrinsic ways. Extrinsic performance evaluated by text classification, which achieved a maximum of 96.48% accuracy. Intrinsic performance evaluatedby word similarity (e.g., semantic, syntactic and relatedness) and analogy tasks. The maximum Pearson (ˆr) correlation accuracy of 60.66% (Ssˆr) achieved for semantic similarities and 71.64% (Syˆr) for syntactic similarities whereas the relatedness obtained 79.80% (Rsˆr). The semantic word analogy tasks achieved 44.00% of accuracy while syntactic word analogy tasks obtained 36.00%

Author(s):  
Piotr Bojanowski ◽  
Edouard Grave ◽  
Armand Joulin ◽  
Tomas Mikolov

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.


2009 ◽  
Vol 12 (7) ◽  
pp. 5-14
Author(s):  
Anh Hoang Tu Nguyen ◽  
Chi Tran Kim Nguyen ◽  
Phi Hong Nguyen

Text representation models are very important pre-processing step in various domains such as text mining, information retrieval, natural language processing. In this paper we summarize graph-based text representation models. Graph-based model can capture structural information such as the location, order and proximity of term occurrence, which is discarded under the standard text vector representation models. We have tested this graph model in Vietnamese text classification system.


2020 ◽  
Vol 16 (4) ◽  
pp. 1-16
Author(s):  
Pin Ni ◽  
Yuming Li ◽  
Victor Chang

Automatic keywords extraction and classification tasks are important research directions in the domains of NLP (natural language processing), information retrieval, and text mining. As the fine granularity abstracted from text data, keywords are also the most important feature of text data, which has great practical and potential value in document classification, topic modeling, information retrieval, and other aspects. The compact representation of documents can be achieved through keywords, which contains massive significant information. Therefore, it may be quite advantageous to realize text classification with high-dimensional feature space. For this reason, this study designed a supervised keyword classification method based on TextRank keyword automatic extraction technology and optimize the model with the genetic algorithm to contribute to modeling the keywords of the topic for text classification.


Author(s):  
Zahra Mousavi ◽  
Heshaam Faili

Nowadays, wordnets are extensively used as a major resource in natural language processing and information retrieval tasks. Therefore, the accuracy of wordnets has a direct influence on the performance of the involved applications. This paper presents a fully-automated method for extending a previously developed Persian wordnet to cover more comprehensive and accurate verbal entries. At first, by using a bilingual dictionary, some Persian verbs are linked to Princeton WordNet synsets. A feature set related to the semantic behavior of compound verbs as the majority of Persian verbs is proposed. This feature set is employed in a supervised classification system to select the proper links for inclusion in the wordnet. We also benefit from a pre-existing Persian wordnet, FarsNet, and a similarity-based method to produce a training set. This is the largest automatically developed Persian wordnet with more than 27,000 words, 28,000 PWN synsets and 67,000 word-sense pairs that substantially outperforms the previous Persian wordnet with about 16,000 words, 22,000 PWN synsets and 38,000 word-sense pairs.


2019 ◽  
Vol 53 (2) ◽  
pp. 3-10
Author(s):  
Muthu Kumar Chandrasekaran ◽  
Philipp Mayr

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.


Sign in / Sign up

Export Citation Format

Share Document