part of speech
Recently Published Documents





Sunita Warjri ◽  
Partha Pakray ◽  
Saralin A. Lyngdoh ◽  
Arnab Kumar Maji

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.

Dana Halabi ◽  
Ebaa Fayyoumi ◽  
Arafat Awajan

Treebanks are valuable linguistic resources that include the syntactic structure of a language sentence in addition to part-of-speech tags and morphological features. They are mainly utilized in modeling statistical parsers. Although the statistical natural language parser has recently become more accurate for languages such as English, those for the Arabic language still have low accuracy. The purpose of this article is to construct a new Arabic dependency treebank based on the traditional Arabic grammatical theory and the characteristics of the Arabic language, to investigate their effects on the accuracy of statistical parsers. The proposed Arabic dependency treebank, called I3rab, contrasts with existing Arabic dependency treebanks in two main concepts. The first concept is the approach of determining the main word of the sentence, and the second concept is the representation of the joined and covert pronouns. To evaluate I3rab, we compared its performance against a subset of Prague Arabic Dependency Treebank that shares a comparable level of details. The conducted experiments show that the percentage improvement reached up to 10.24% in UAS and 18.42% in LAS.

2022 ◽  
Vol 8 (1) ◽  
pp. 338-344
D. Akmatova

The history of the origin of imitative has ancient roots. People have been interested in imitations since ancient times. Not only linguists, but also even philosophers, psychologists, to one degree or another at different times, have addressed the problem of sound visualization. Imitative vocabulary helps to increase the imagery and emotional expressiveness of the word. However, due to its complex nature, for some time, linguists of various languages did not conduct serious research on the linguistic functions of imitative words. However, they are often found in oral folk art and fiction, giving these texts artistic and poetic meaning, expressiveness, imagery, artistic power and accessibility, liveliness and dynamism, all this has led, now, to the fact that linguists began to pay close attention the study of this unusual group of words. Тhey are divided into types, separated into a special part of speech, they are used in the formation of new words, they act as members of a sentence.

2022 ◽  
Vol 12 (1) ◽  
Marie-Thérèse Le Normand ◽  
Hung Thai-Van

AbstractThe question of how children learn Function Words (FWs) is still a matter of debate among child language researchers. Are early multiword utterances based on lexically specific patterns or rather abstract grammatical relations? In this corpus study, we analyzed FWs having a highly predictable distribution in relation to Mean Length Utterance (MLU) an index of syntactic complexity in a large naturalistic sample of 315 monolingual French children aged 2 to 4 year-old. The data was annotated with a Part Of Speech Tagger (POS-T), belonging to computational tools from CHILDES. While eighteen FWs strongly correlated with MLU expressed either in word or in morpheme, stepwise regression analyses showed that subject pronouns predicted MLU. Factor analysis yielded a bifactor hierarchical model: The first factor loaded sixteen FWs among which eight had a strong developmental weight (third person singular verbs, subject pronouns, articles, auxiliary verbs, prepositions, modals, demonstrative pronouns and plural markers), whereas the second factor loaded complex FWs (possessive verbs and object pronouns). These findings challenge the lexicalist account and support the view that children learn grammatical forms as a complex system based on early instead of late structure building. Children may acquire FWs as combining words and build syntactic knowledge as a complex abstract system which is not innate but learned from multiple word input sentences context. Notably, FWs were found to predict syntactic development and sentence complexity. These results open up new perspectives for clinical assessment and intervention.

2022 ◽  
Vol 2022 ◽  
pp. 1-14
Chun-Xiang Zhang ◽  
Shu-Yang Pang ◽  
Xue-Yao Gao ◽  
Jia-Qi Lu ◽  
Bo Yu

In order to improve the disambiguation accuracy of biomedical words, this paper proposes a disambiguation method based on the attention neural network. The biomedical word is viewed as the center. Morphology, part of speech, and semantic information from 4 adjacent lexical units are extracted as disambiguation features. The attention layer is used to generate a feature matrix. Average asymmetric convolutional neural networks (Av-ACNN) and bidirectional long short-term memory (Bi-LSTM) networks are utilized to extract features. The softmax function is applied to determine the semantic category of the biomedical word. At the same time, CNN, LSTM, and Bi-LSTM are applied to biomedical WSD. MSH corpus is adopted to optimize CNN, LSTM, Bi-LSTM, and the proposed method and testify their disambiguation performance. Experimental results show that the average disambiguation accuracy of the proposed method is improved compared with CNN, LSTM, and Bi-LSTM. The average disambiguation accuracy of the proposed method achieves 91.38%.

2021 ◽  
pp. 587-595
Alebachew Chiche ◽  
Hiwot Kadi ◽  
Tibebu Bekele

Natural language processing plays a great role in providing an interface for human-computer communication. It enables people to talk with the computer in their formal language rather than machine language. This study aims at presenting a Part of speech tagger that can assign word class to words in a given paragraph sentence. Some of the researchers developed parts of speech taggers for different languages such as English Amharic, Afan Oromo, Tigrigna, etc. On the other hand, many other languages do not have POS taggers like Shekki’noono language.  POS tagger is incorporated in most natural language processing tools like machine translation, information extraction as a basic component. So, it is compulsory to develop a part of speech tagger for languages then it is possible to work with an advanced natural language application. Because those applications enhance machine to machine, machine to human, and human to human communications. Although, one language POS tagger cannot be directly applied for other languages POS tagger. With the purpose for developing the Shekki’noono POS tagger, we have used the stochastic Hidden Markov Model. For the study, we have used 1500 sentences collected from different sources such as newspapers (which includes social, economic, and political aspects), modules, textbooks, Radio Programs, and bulletins.  The collected sentences are labeled by language experts with their appropriate parts of speech for each word.  With the experiments carried out, the part of speech tagger is trained on the training sets using Hidden Markov model. As experiments showed, HMM based POS tagging has achieved 92.77 % accuracy for Shekki’noono. And the POS tagger model is compared with the previous experiments in related works using HMM. As a future work, the proposed approaches can be utilized to perform an evaluation on a larger corpus.

2021 ◽  
Vol 5 (4) ◽  
pp. 430
Jaehong Kim ◽  
Hosung Woo ◽  
Jamee Kim ◽  
WonGyu Lee

With the development of information and communication technology, countries around the world have strengthened their computer science curriculums. Korea also revised the informatics curriculum(The name of a subject related to computer science in Korea is informatics.) in 2015 with a focus on computer science. The purpose of this study was to automatically extract and analyze whether textbooks reflected the learning elements of the informatics curriculum in South Korea. Considering the forms of terms of the learning elements mainly comprised of compound words and the characteristics of Korean language, which makes natural language processing difficult due to various transformations, this study pre-processed textbook texts and the learning elements and derived their reflection status and frequencies. The terms used in the textbooks were automatically extracted by using the indexes in the textbooks and the part-of-speech compositions of the indexes. Moreover, this study analyzed the relevance between the terms by deriving confidence of other terms for each learning element used in the textbooks. As a result of the analysis, this study revealed that the textbooks did not reflect some learning elements in the forms presented in the curriculum, suggesting that the textbooks need to explain the concepts of the learning elements by using the forms presented in the curriculum at least once. This study is meaningful in that terms were automatically extracted and analyzed in Korean textbooks based on the words suggested by the curriculum. Also, the method can be applied equally to textbooks of other subjects.

Sign in / Sign up

Export Citation Format

Share Document