Natural Language Processing with DeepPavlov Library and Additional Semantic Features

Author(s):  
Oleg Sattarov
2021 ◽  
Vol 12 ◽  
Author(s):  
Changcheng Wu ◽  
Junyi Li ◽  
Ye Zhang ◽  
Chunmei Lan ◽  
Kaiji Zhou ◽  
...  

Nowadays, most courses in massive open online course (MOOC) platforms are xMOOCs, which are based on the traditional instruction-driven principle. Course lecture is still the key component of the course. Thus, analyzing lectures of the instructors of xMOOCs would be helpful to evaluate the course quality and provide feedback to instructors and researchers. The current study aimed to portray the lecture styles of instructors in MOOCs from the perspective of natural language processing. Specifically, 129 course transcripts were downloaded from two major MOOC platforms. Two semantic analysis tools (linguistic inquiry and word count and Coh-Metrix) were used to extract semantic features including self-reference, tone, effect, cognitive words, cohesion, complex words, and sentence length. On the basis of the comments of students, course video review, and the results of cluster analysis, we found four different lecture styles: “perfect,” “communicative,” “balanced,” and “serious.” Significant differences were found between the different lecture styles within different disciplines for notes taking, discussion posts, and overall course satisfaction. Future studies could use fine-grained log data to verify the results of our study and explore how to use the results of natural language processing to improve the lecture of instructors in both MOOCs and traditional classes.


2017 ◽  
Vol 44 (4) ◽  
pp. 526-551 ◽  
Author(s):  
Abdulgabbar Saif ◽  
Nazlia Omar ◽  
Mohd Juzaiddin Ab Aziz ◽  
Ummi Zakiah Zainodin ◽  
Naomie Salim

Wikipedia has become a high coverage knowledge source which has been used in many research areas such as natural language processing, text mining and information retrieval. Several methods have been introduced for extracting explicit or implicit relations from Wikipedia to represent semantics of concepts/words. However, the main challenge in semantic representation is how to incorporate different types of semantic relations to capture more semantic evidences of the associations of concepts. In this article, we propose a semantic concept model that incorporates different types of semantic features extracting from Wikipedia. For each concept that corresponds to an article, four semantic features are introduced: template links, categories, salient concepts and topics. The proposed model is based on the probability distributions that are defined for these semantic features of a Wikipedia concept. The template links and categories are the document-level features which are directly extracted from the structured information included in the article. On the other hand, the salient concepts and topics are corpus-level features which are extracted to capture implicit relations among concepts. For the salient concepts feature, the distributional-based method is utilised on the hypertext corpus to extract this feature for each Wikipedia concept. Then, the probability product kernel is used to improve the weight of each concept in this feature. For the topic feature, the Labelled latent Dirichlet allocation is adapted on the supervised multi-label of Wikipedia to train the probabilistic model of this feature. Finally, we used the linear interpolation for incorporating these semantic features into the probabilistic model to estimate the semantic relation probability of the specific concept over Wikipedia articles. The proposed model is evaluated on 12 benchmark datasets in three natural language processing tasks: measuring the semantic relatedness of concepts/words in general and in the biomedical domain, semantic textual relatedness measurement and measuring the semantic compositionality of noun compounds. The model is also compared with five methods that depends on separate semantic features in Wikipedia. Experimental results show that the proposed model achieves promising results in three tasks and outperforms the baseline methods in most of the evaluation datasets. This implies that incorporation of explicit and implicit semantic features is useful for representing semantics of concepts in Wikipedia.


2012 ◽  
Vol 5s1 ◽  
pp. BII.S8960 ◽  
Author(s):  
Bart Desmet ◽  
Véronique Hoste

This paper describes a system for automatic emotion classification, developed for the 2011 i2b2 Natural Language Processing Challenge, Track 2. The objective of the shared task was to label suicide notes with 15 relevant emotions on the sentence level. Our system uses 15 SVM models (one for each emotion) using the combination of features that was found to perform best on a given emotion. Features included lemmas and trigram bag of words, and information from semantic resources such as WordNet, SentiWordNet and subjectivity clues. The best-performing system labeled 7 of the 15 emotions and achieved an F-score of 53.31% on the test data.


2020 ◽  
Author(s):  
M Krishna Siva Prasad ◽  
Poonam Sharma

Abstract Short text or sentence similarity is crucial in various natural language processing activities. Traditional measures for sentence similarity consider word order, semantic features and role annotations of text to derive the similarity. These measures do not suit short texts or sentences with negation. Hence, this paper proposes an approach to determine the semantic similarity of sentences and also presents an algorithm to handle negation. In sentence similarity, word pair similarity plays a significant role. Hence, this paper also discusses the similarity between word pairs. Existing semantic similarity measures do not handle antonyms accurately. Hence, this paper proposes an algorithm to handle antonyms. This paper also presents an antonym dataset with 111-word pairs and corresponding expert ratings. The existing semantic similarity measures are tested on the dataset. The results of the correlation proved that the expert ratings are in order with the correlation obtained from the semantic similarity measures. The sentence similarity is handled by proposing two algorithms. The first algorithm deals with the typical sentences, and the second algorithm deals with contradiction in the sentences. SICK dataset, which has sentences with negation, is considered for handling the sentence similarity. The algorithm helped in improving the results of sentence similarity.


Author(s):  
Oleg Kuzmin ◽  

The modern world is moving towards global digitalization and accelerated software development with a clear tendency to replace human resources by digital services or programs that imitate the doing of similar tasks. There is no doubt that, long term, the use of such technologies has economic benefits for enterprises and companies. Despite this, however, the quality of the final result is often less than satisfactory, and machine translation systems are no exception, as editing of texts translated by using online translation services is still a demanding task. At the moment, producing high-quality translations using only machine translation systems remains impossible for multiple reasons, the main of which lies in the mysteries of natural language: the existence of sublanguages, abstract words, polysemy, etc. Since improving the quality of machine translation systems is one of the priorities of natural language processing (NLP), this article describes current trends in developing modern machine translation systems as well as the latest advances in the field of natural language processing (NLP) and gives suggestions about software innovations that would minimize the number of errors. Even though recent years have seen a significant breakthrough in the speed of information analysis, in all probability, this will not be a priority issue in the future. The main criteria for evaluating the quality of translated texts will be the semantic coherence of these texts and the semantic accuracy of the lexical material used. To improve machine translation systems, we should introduce elements of data differentiation and personalization of information for individual users and their tasks, employing the method of thematic modeling for determining the subject area of a particular text. Currently, there are algorithms based on deep learning that are able to perform these tasks. However, the process of identifying unique lexical units requires a more detailed linguistic description of their semantic features. The parsing methods that will be used in analyzing texts should also provide for the possibility of clustering by sublanguages. Creating automated electronic dictionaries for specific fields of professional knowledge will help improve the quality of machine translation systems. Notably, to date there have been no successful projects of creating dictionaries for machine translation systems for specific sub-languages. Thus, there is a need to develop such dictionaries and to integrate them into existing online translation systems.


2020 ◽  
pp. 3-17
Author(s):  
Peter Nabende

Natural Language Processing for under-resourced languages is now a mainstream research area. However, there are limited studies on Natural Language Processing applications for many indigenous East African languages. As a contribution to covering the current gap of knowledge, this paper focuses on evaluating the application of well-established machine translation methods for one heavily under-resourced indigenous East African language called Lumasaaba. Specifically, we review the most common machine translation methods in the context of Lumasaaba including both rule-based and data-driven methods. Then we apply a state of the art data-driven machine translation method to learn models for automating translation between Lumasaaba and English using a very limited data set of parallel sentences. Automatic evaluation results show that a transformer-based Neural Machine Translation model architecture leads to consistently better BLEU scores than the recurrent neural network-based models. Moreover, the automatically generated translations can be comprehended to a reasonable extent and are usually associated with the source language input.


Diabetes ◽  
2019 ◽  
Vol 68 (Supplement 1) ◽  
pp. 1243-P
Author(s):  
JIANMIN WU ◽  
FRITHA J. MORRISON ◽  
ZHENXIANG ZHAO ◽  
XUANYAO HE ◽  
MARIA SHUBINA ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document