Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.
Sentiment Analysis (SA) is a Natural Language Processing (NLP) and an Information Extraction (IE) task that primarily aims to obtain the writer’s feelings expressed in positive or negative by analyzing a large number of documents. SA is also widely studied in the fields of data mining, web mining, text mining, and information retrieval. The fundamental task in sentiment analysis is to classify the polarity of a given content as
. Although extensive research has been conducted in this area of computational linguistics, most of the research work has been carried out in the context of English language. However, Bengali sentiment expression has varying degree of sentiment labels, which can be plausibly distinct from English language. Therefore, sentiment assessment of Bengali language is undeniably important to be developed and executed properly. In sentiment analysis, the prediction potential of an automatic modeling is completely dependent on the quality of dataset annotation. Bengali sentiment annotation is a challenging task due to diversified structures (syntax) of the language and its different degrees of innate sentiments (i.e., weakly and strongly positive/negative sentiments). Thus, in this article, we propose a novel and precise guideline for the researchers, linguistic experts, and referees to annotate Bengali sentences immaculately with a view to building effective datasets for automatic sentiment prediction efficiently.
This work addresses Sri Aurobindo’s mantric poem, Savitri, with a computational linguistics approach. This is one of the longest poems ever written in English. We build the connectivity matrix between all main word pairs and analyse its structure. Concepts emerge as directions that better explain the variance of the data in the hyperspace of words. When projected to the low dimensional space of concepts, the vector of attention as the reader moves through the text shows a large correlation across sections of the poem, thus acting the future and the past over again. These findings suggest that the mathematical structure of Savitri is and reflects a substrate for the author’s main ideas, facilitating the reader’s understanding of the poem’s meaning via its long-range dynamical correlations. Acknowledging an irreducible essence to poetry, future studies on the relationship between words and sounds, and sounds and ideas may provide invaluable hints of the origin of language and its intimate relationship with the evolution of human consciousness.
Analyzing statements of facts and claims in online discourse is subject of a multitude of research areas. Methods from natural language processing and computational linguistics help investigate issues such as the spread of biased narratives and falsehoods on the Web. Related tasks include fact-checking, stance detection and argumentation mining. Knowledge-based approaches, in particular works in knowledge base construction and augmentation, are concerned with mining, verifying and representing factual knowledge. While all these fields are concerned with strongly related notions, such as claims, facts and evidence, terminology and conceptualisations used across and within communities vary heavily, making it hard to assess commonalities and relations of related works and how research in one field may contribute to address problems in another. We survey the state-of-the-art from a range of fields in this interdisciplinary area across a range of research tasks. We assess varying definitions and propose a conceptual model – Open Claims – for claims and related notions that takes into consideration their inherent complexity, distinguishing between their meaning, linguistic representation and context. We also introduce an implementation of this model by using established vocabularies and discuss applications across various tasks related to online discourse analysis.
Machine Learning (ML) and Artificial Intelligence (AI) methods are transforming many commercial and academic areas, including feature extraction, autonomous driving, computational linguistics, and voice recognition. These new technologies are now having a significant effect in radiography, forensics, and many other areas where the accessibility of automated systems may improve the precision and repeatability of essential job performance. In this systematic review, we begin by providing a short overview of the different methods that are currently being developed, with a particular emphasis on those utilized in biomedical studies.
With the rapid increase in the number of available digital texts in schools, new methodological approaches to studying writing development in education are now emerging. However, with new methodological approaches follow new epistemological challenges. In this article, I examine some of these challenges and discuss how they affect the role of computational linguistics within the field of educational writing research. The article is structured around three main sections. First, I position computational linguistics within the wider field of educational writing research with particular focus on L1 writing and K12 education. Second, I discuss to what extent methods from computational linguistics can provide us with new insights into different aspects of educational writing. Third, I discuss the potential of the concept of affordance to bridge between technology-centered and human-centered methodological approaches, and I relate this idea to recent theoretical developments in the digital humanities. Based on this discussion, I conclude the article with suggestions for possible directions in future writing research.
This paper studies the use of emotion and reason in political discourse. Adopting computational-linguistics techniques to construct a validated text-based scale, we measure emotionality in 6 million speeches given in U.S. Congress over the years 1858-2014. Intuitively, emotionality spikes during times of war and is highest in speeches about patriotism. In the time series, emotionality was relatively low and stable in earlier years but increased significantly starting in the late 1970s. Across Congress Members, emotionality is higher for Democrats, for women, for ethnic/religious minorities, for the opposition party, and for members with ideologically extreme roll-call voting records.
Users of forums, social networks and news portals now have the opportunity to publicly express their opinions on current political events, social issues, or their everyday lives. The analysis of opinion expression, which primarily represented a research topic in the field of language learning, has now become an important research challenge in the field of computational linguistics, which provides relevant solutions for various companies and organizations. The aim of this article is to analyse messages by which users of the social network Twitter reacted to an incident in which Emmanuel Macron was slapped in the face by a man as he went out to meet the public. We analysed the tweets that express agreement, disagreement and a neutral attitude towards the action. The analysis includes 80 tweets and refers to the textual, syntactic and lexical levels. The results show that tweets expressing disagreement have a typical declarative or exclamatory form, simple sentence structure and include explicit vocabulary expressing the author’s opinion (shameful, disrespectful). Tweets demonstrating agreement are more likely to have an exclamatory form, simple sentence structure and include an explicit term (well done, deserve a slap). Opinion-neutral tweets, on the other hand, are more likely to be formulated as declarative sentences with complex sentence structure and do not include an explicit term expressing the author’s opinion. The presented method is established on basic grammatical criteria (number of sentences, sentence structure, sentence form, keywords), which can also be applied to computational analysis of large collections of texts. In the future, the presented model could be applied to investigate various political, societal or healthcare challenges (elections, corruption or pandemic issues).