Semantic concept model using Wikipedia semantic features

2017 ◽  
Vol 44 (4) ◽  
pp. 526-551 ◽  
Author(s):  
Abdulgabbar Saif ◽  
Nazlia Omar ◽  
Mohd Juzaiddin Ab Aziz ◽  
Ummi Zakiah Zainodin ◽  
Naomie Salim

Wikipedia has become a high coverage knowledge source which has been used in many research areas such as natural language processing, text mining and information retrieval. Several methods have been introduced for extracting explicit or implicit relations from Wikipedia to represent semantics of concepts/words. However, the main challenge in semantic representation is how to incorporate different types of semantic relations to capture more semantic evidences of the associations of concepts. In this article, we propose a semantic concept model that incorporates different types of semantic features extracting from Wikipedia. For each concept that corresponds to an article, four semantic features are introduced: template links, categories, salient concepts and topics. The proposed model is based on the probability distributions that are defined for these semantic features of a Wikipedia concept. The template links and categories are the document-level features which are directly extracted from the structured information included in the article. On the other hand, the salient concepts and topics are corpus-level features which are extracted to capture implicit relations among concepts. For the salient concepts feature, the distributional-based method is utilised on the hypertext corpus to extract this feature for each Wikipedia concept. Then, the probability product kernel is used to improve the weight of each concept in this feature. For the topic feature, the Labelled latent Dirichlet allocation is adapted on the supervised multi-label of Wikipedia to train the probabilistic model of this feature. Finally, we used the linear interpolation for incorporating these semantic features into the probabilistic model to estimate the semantic relation probability of the specific concept over Wikipedia articles. The proposed model is evaluated on 12 benchmark datasets in three natural language processing tasks: measuring the semantic relatedness of concepts/words in general and in the biomedical domain, semantic textual relatedness measurement and measuring the semantic compositionality of noun compounds. The model is also compared with five methods that depends on separate semantic features in Wikipedia. Experimental results show that the proposed model achieves promising results in three tasks and outperforms the baseline methods in most of the evaluation datasets. This implies that incorporation of explicit and implicit semantic features is useful for representing semantics of concepts in Wikipedia.

Author(s):  
Santosh Kumar Mishra ◽  
Rijul Dhir ◽  
Sriparna Saha ◽  
Pushpak Bhattacharyya

Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision is used for understanding images, and natural language processing is used for language modeling. A lot of works have been done for image captioning for the English language. In this article, we have developed a model for image captioning in the Hindi language. Hindi is the official language of India, and it is the fourth most spoken language in the world, spoken in India and South Asia. To the best of our knowledge, this is the first attempt to generate image captions in the Hindi language. A dataset is manually created by translating well known MSCOCO dataset from English to Hindi. Finally, different types of attention-based architectures are developed for image captioning in the Hindi language. These attention mechanisms are new for the Hindi language, as those have never been used for the Hindi language. The obtained results of the proposed model are compared with several baselines in terms of BLEU scores, and the results show that our model performs better than others. Manual evaluation of the obtained captions in terms of adequacy and fluency also reveals the effectiveness of our proposed approach. Availability of resources : The codes of the article are available at https://github.com/santosh1821cs03/Image_Captioning_Hindi_Language ; The dataset will be made available: http://www.iitp.ac.in/∼ai-nlp-ml/resources.html .


2020 ◽  
pp. 016555152096278
Author(s):  
Rouzbeh Ghasemi ◽  
Seyed Arad Ashrafi Asli ◽  
Saeedeh Momtazi

With the advent of deep neural models in natural language processing tasks, having a large amount of training data plays an essential role in achieving accurate models. Creating valid training data, however, is a challenging issue in many low-resource languages. This problem results in a significant difference between the accuracy of available natural language processing tools for low-resource languages compared with rich languages. To address this problem in the sentiment analysis task in the Persian language, we propose a cross-lingual deep learning framework to benefit from available training data of English. We deployed cross-lingual embedding to model sentiment analysis as a transfer learning model which transfers a model from a rich-resource language to low-resource ones. Our model is flexible to use any cross-lingual word embedding model and any deep architecture for text classification. Our experiments on English Amazon dataset and Persian Digikala dataset using two different embedding models and four different classification networks show the superiority of the proposed model compared with the state-of-the-art monolingual techniques. Based on our experiment, the performance of Persian sentiment analysis improves 22% in static embedding and 9% in dynamic embedding. Our proposed model is general and language-independent; that is, it can be used for any low-resource language, once a cross-lingual embedding is available for the source–target language pair. Moreover, by benefitting from word-aligned cross-lingual embedding, the only required data for a reliable cross-lingual embedding is a bilingual dictionary that is available between almost all languages and the English language, as a potential source language.


Author(s):  
Maitri Patel and Dr Hemant D Vasava

Data,Information or knoweldge,in this rapidly moving and growing world.we can find any kind of information on Internet.And this can be too useful,however for acedemic world too it is useful but along with it plagarism is highly in practice.Which makes orginality of work degrade and fraudly using someones original work and later not acknowleging them is becoming common.And some times teachers or professors could not identify the plagarised information provided.So higher educational systems nowadays use different types of tools to compare.Here we have an idea to match no of different documents like assignments of students to compare with each other to find out, did they copied each other’s work?Also an idea to compare ideal answeer sheet of particular subject examination to similar test sheets of students.Idea is to compare and on similarity basis we can rank them.Both approach is one kind and that is to compare documents.To identify plagarism there are many methods used already.So we could compare and develop them if needed.


2020 ◽  
Vol 34 (02) ◽  
pp. 1741-1748 ◽  
Author(s):  
Meng-Hsuan Yu ◽  
Juntao Li ◽  
Danyang Liu ◽  
Dongyan Zhao ◽  
Rui Yan ◽  
...  

Automatic Storytelling has consistently been a challenging area in the field of natural language processing. Despite considerable achievements have been made, the gap between automatically generated stories and human-written stories is still significant. Moreover, the limitations of existing automatic storytelling methods are obvious, e.g., the consistency of content, wording diversity. In this paper, we proposed a multi-pass hierarchical conditional variational autoencoder model to overcome the challenges and limitations in existing automatic storytelling models. While the conditional variational autoencoder (CVAE) model has been employed to generate diversified content, the hierarchical structure and multi-pass editing scheme allow the story to create more consistent content. We conduct extensive experiments on the ROCStories Dataset. The results verified the validity and effectiveness of our proposed model and yields substantial improvement over the existing state-of-the-art approaches.


2018 ◽  
Vol 12 (02) ◽  
pp. 237-260
Author(s):  
Weifeng Xu ◽  
Dianxiang Xu ◽  
Abdulrahman Alatawi ◽  
Omar El Ariss ◽  
Yunkai Liu

Unigram is a fundamental element of [Formula: see text]-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical properties regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. We describe a probabilistic model which relies on these properties for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. Our empirical study shows that using the unigrams extracted from source code repository outperforms the using of the natural language corpus by 21% when solving the domain specific problems.


2004 ◽  
Vol 9 (1) ◽  
pp. 53-68 ◽  
Author(s):  
Montserrat Arévalo Rodríguez ◽  
Montserrat Civit Torruella ◽  
Maria Antònia Martí

In the field of corpus linguistics, Named Entity treatment includes the recognition and classification of different types of discursive elements like proper names, date, time, etc. These discursive elements play an important role in different Natural Language Processing applications and techniques such as Information Retrieval, Information Extraction, translations memories, document routers, etc.


2021 ◽  
Vol 12 ◽  
Author(s):  
Changcheng Wu ◽  
Junyi Li ◽  
Ye Zhang ◽  
Chunmei Lan ◽  
Kaiji Zhou ◽  
...  

Nowadays, most courses in massive open online course (MOOC) platforms are xMOOCs, which are based on the traditional instruction-driven principle. Course lecture is still the key component of the course. Thus, analyzing lectures of the instructors of xMOOCs would be helpful to evaluate the course quality and provide feedback to instructors and researchers. The current study aimed to portray the lecture styles of instructors in MOOCs from the perspective of natural language processing. Specifically, 129 course transcripts were downloaded from two major MOOC platforms. Two semantic analysis tools (linguistic inquiry and word count and Coh-Metrix) were used to extract semantic features including self-reference, tone, effect, cognitive words, cohesion, complex words, and sentence length. On the basis of the comments of students, course video review, and the results of cluster analysis, we found four different lecture styles: “perfect,” “communicative,” “balanced,” and “serious.” Significant differences were found between the different lecture styles within different disciplines for notes taking, discussion posts, and overall course satisfaction. Future studies could use fine-grained log data to verify the results of our study and explore how to use the results of natural language processing to improve the lecture of instructors in both MOOCs and traditional classes.


2021 ◽  
Author(s):  
Yoojoong Kim ◽  
Jeong Moon Lee ◽  
Moon Joung Jang ◽  
Yun Jin Yum ◽  
Jong-Ho Kim ◽  
...  

BACKGROUND With advances in deep learning and natural language processing, analyzing medical texts is becoming increasingly important. Nonetheless, a study on medical-specific language models has not yet been conducted given the importance of medical texts. OBJECTIVE Korean medical text is highly difficult to analyze because of the agglutinative characteristics of the language as well as the complex terminologies in the medical domain. To solve this problem, we collected a Korean medical corpus and used it to train language models. METHODS In this paper, we present a Korean medical language model based on deep learning natural language processing. The proposed model was trained using the pre-training framework of BERT for the medical context based on a state-of-the-art Korean language model. RESULTS After pre-training, the proposed method showed increased accuracies of 0.147 and 0.148 for the masked language model with next sentence prediction. In the intrinsic evaluation, the next sentence prediction accuracy improved by 0.258, which is a remarkable enhancement. In addition, the extrinsic evaluation of Korean medical semantic textual similarity data showed a 0.046 increase in the Pearson correlation. CONCLUSIONS The results demonstrated the superiority of the proposed model for Korean medical natural language processing. We expect that our proposed model can be extended for application to various languages and domains.


2020 ◽  
Vol 10 (11) ◽  
pp. 3740
Author(s):  
Hongjin Kim ◽  
Harksoo Kim

In well-spaced Korean sentences, morphological analysis is the first step in natural language processing, in which a Korean sentence is segmented into a sequence of morphemes and the parts of speech of the segmented morphemes are determined. Named entity recognition is a natural language processing task carried out to obtain morpheme sequences with specific meanings, such as person, location, and organization names. Although morphological analysis and named entity recognition are closely associated with each other, they have been independently studied and have exhibited the inevitable error propagation problem. Hence, we propose an integrated model based on label attention networks that simultaneously performs morphological analysis and named entity recognition. The proposed model comprises two layers of neural network models that are closely associated with each other. The lower layer performs a morphological analysis, whereas the upper layer performs a named entity recognition. In our experiments using a public gold-labeled dataset, the proposed model outperformed previous state-of-the-art models used for morphological analysis and named entity recognition. Furthermore, the results indicated that the integrated architecture could alleviate the error propagation problem.


Sign in / Sign up

Export Citation Format

Share Document