scholarly journals Lemmatisation for under-resourced languages with sequence-to-sequence learning: A case of Early Irish

10.29007/cxtl ◽  
2019 ◽  
Author(s):  
Oksana Dereza

Lemmatisation, which is one of the most important stages of text preprocessing, consists in grouping the inflected forms of a word together so they can be analysed as a single item, identified by the word’s lemma, or dictionary form. It is not a very complicated task for languages such as English, where a paradigm consists of a few forms close in spelling; but when it comes to morphologically rich languages, such as Russian, Hungarian or Irish, lemmatisation becomes more challenging. However, this task is often considered solved for most resource-rich modern languages irregardless of their morphological type. The situation is dramatically different for ancient languages characterised not only by a rich inflectional system, but also by a high level of orthographic variation, and, what is more important, a very little amount of available data. These factors make automatic morphological analysis of historical language data an underrepresented field in comparison to other NLP tasks. This work describes a case of creating an Early Irish lemmatiser with a character-level sequence-to-sequence learning method that proves efficient to overcome data scarcity. A simple character-level sequence-to-sequence model trained during 34,000 iterations reached the accuracy score of 99.2 % for known words and 64.9 % for unknown words on a rather small corpus of 83,155 samples. It outperforms both the baseline and the rule-based model described in [21] and [76] and meets the results of other systems working with historical data.

2003 ◽  
Vol 25 (1) ◽  
pp. 87 ◽  
Author(s):  
K Belov ◽  
L Hellman

A full-length cDNA clone encoding the platypus (Ornithorynchus anatinus) immunoglobulin M (IgM) heavy chain was isolated from a spleen cDNA library using a short-beaked echidna (Tachyglossus aculeatus) IgM constant region (Cµ) probe. The isolation of platypus IgM shows that O. anatinus, like all other examined jawed vertebrates, express a classical IgM molecule. Amino acid sequence comparisons of the constant regions of IgM reveals a high level sequence conservation between O. anatinus and T. aculeatus sequences (87%), and only approximately 48% identity between O. anatinus and therian Cµ sequences. The variable region of this clone belongs to clan 3, supporting the view that this family is used preferentially, if not exclusively by O. anatinus, as opposed to the use of all three variable region clans by T. aculeatus. Phylogenetic analysis of Cµ sequences supports the traditional Theria hypothesis and suggests that the O. anatinus and T. aculeatus lineages separated from their last common ancestor approximately 21 million years ago.


2010 ◽  
Vol 01 (04) ◽  
pp. 377-393 ◽  
Author(s):  
H.J. Kam ◽  
Y.M. Shin ◽  
S.M. Cho ◽  
S.Y. Kim ◽  
K.W. Kim ◽  
...  

Summary Objective: Questionnaire-based ADHD screening tests may not always be objective or accurate, owing to both subjectivity and prejudice. Despite attempts to develop objective measures to characterize ADHD, no widely applicable index currently exists. The principal aim of this study was to develop a decision support model for ADHD screening by monitoring children’s school activities using a 3-axial actigraph. Methods: Actigraphs were placed on the non-dominant wrists of 153 children for 3 hours, while they were at school. Children who scored high on the questionnaires were clinically examined by child psychiatrists, who then confirmed ADHD. Mean, variance, and ratios of low-level (0.5-1.0G) and high-level (1.6-3.2G) activity were extracted as activity features from 142 children (10 ADHD, 132 non-ADHD). Two decision-tree models were constructed using the C5.0 algorithm: [A] from whole hours (class + playtime) and [B] during classes. Accuracy, sensitivity, and specificity were evaluated. PPV, NPV, likelihood ratio, and AUC were also calculated for evaluation. Results: [Model A] One child without ADHD was misclassified, resulting in an accuracy score of 99.30%. Sensitivity and NPV were 1.0000. Specificity and PPV were 0.992 and 0.803-0.909, respectively. [Model B] Two children without ADHD were misclassified, resulting in an accuracy score of 98.59%. Specificity and PPV were scored at 0.985 and 0.671-0.832, respectively. Conclusion: The selected features were consistent with the findings of previous studies. Objective screening of latent patients with ADHD can be accomplished with a simple watch-like sensor, which is worn for just a few hours while the child attends school. The model proposed herein can be applied to a great many children without heavy cost in time and manpower cost, and would generate valuable results from a public health perspective.


2014 ◽  
Vol 7 (1) ◽  
pp. 98-137 ◽  
Author(s):  
THORA TENBRINK

abstractThis paper offers the first general introduction to CODA (Cognitive Discourse Analysis), a methodology for analyzing verbal protocols and other types of unconstrained language use, as a resource for researchers interested in mental representations and high-level cognitive processes. CODA can be used to investigate verbalizations of perceived scenes and events, spatio-temporal concepts, complex cognitive processes such as problem-solving and cognitive strategies and heuristics, and other concepts that are accessible for verbalization. CODA builds on and extends relevant established methodologies such as cognitive linguistic perspectives, verbal protocol analysis in cognitive psychology and interdisciplinary content analysis, linguistic discourse analysis, and psycholinguistic experimentation.


2020 ◽  
Author(s):  
Norito Kawakami ◽  
Natsu Sasaki ◽  
Reiko Kuroda ◽  
Kanami Tsuno ◽  
Kotaro Imamura

BACKGROUND The use of a COVID-19 contact tracing app may be effective in reducing anxiety about COVID-19 and psychological distress of users. OBJECTIVE This 2.5-month prospective study aimed to investigate the association of the use of a COVID-19 contact tracing app, the COVID-19 Contact Confirming Application (COCOA), released by the Japanese government with fear and worry about COVID-19 and psychological distress in a sample of the general working population of Japan. METHODS A total of 996 full-time employed respondents to an online survey on May 22-26, 2020 (baseline) were invited to participate in a follow-up survey on August 7-12, 2020 (follow-up). High level of worrying about COVID-19 and high psychological distress were defined by scores on a single-item scale and the K6 scale, respectively, both at baseline and follow-up. The app was released between the two surveys on June 17. Participants were asked at follow-up if they downloaded the app. RESULTS A total of 902 (90.6%) out of 996 baseline participants responded to the follow-up survey. Among them, 184 (20.4%) reported that they downloaded the app. The use of the contact tracing app was significantly negatively associated with psychological distress, but not with fear and worry about COVID-19, at follow-up after controlling for baseline variables. CONCLUSIONS The study provided first evidence that a COVID-19 contact tracing app is beneficial for the mental health of people under the COVID-19 outbreak. CLINICALTRIAL N/A


2021 ◽  
Vol 233 ◽  
pp. 107519
Author(s):  
Haoran Zhao ◽  
Xin Sun ◽  
Junyu Dong ◽  
Zihe Dong ◽  
Qiong Li

2021 ◽  
Author(s):  
Thien Pham ◽  
Loi Truong ◽  
Mao Nguyen ◽  
Akhil Garg ◽  
Liang Gao ◽  
...  

State-of-Health (SOH) prediction of a Lithium-ion battery is essential for preventing malfunction and maintaining efficient working behaviors for the battery. In practice, this task is difficult due to the high level of noise and complexity. There are many machine learning methods, especially deep learning approaches, that have been proposed to address this problem recently. However, there is much room for improvement because the nature of the battery data is highly non-linear and exhibits higher dependence on multidisciplinary parameters such as resistance, voltage and external conditions the battery is subjected to. In this paper, we propose an approach known as bidirectional sequence-in-sequence, which exploits the dependency of nested cycle-wise and channel-wise battery data. Experimented with real dataset acquired from NASA, our method results in significant reduction of error of approximately up to 32.5%.


2017 ◽  
Vol 13 (2) ◽  
pp. 616-624 ◽  
Author(s):  
Haijun Zhang ◽  
Jingxuan Li ◽  
Yuzhu Ji ◽  
Heng Yue

Author(s):  
Bhagyashri Wagh ◽  
J. V. Shinde ◽  
P. A. Kale

In today’s world, Social Networking website like Twitter, Facebook , Tumbler, etc. plays a very significant role. Twitter is a micro-blogging platform which provides a tremendous amount of data which can be used for various application of sentiment Analysis like predictions, review, elections, marketing, etc Sentiment Analysis is a process of extracting information from large amount of data, and classifies them into different classes called sentiments. Python is simple yet powerful, high-level, interpreted and dynamic programming language, which is well known for its functionality of processing natural language data by using NLTK (Natural Language Toolkit). NLTK is a library of python, which provides a base for building programs and classification of data. NLTK also provide graphical demonstration for representing various results or trends and it also provide sample data to train and test various classifier respectively. Sentiment classification aims to automatically predict sentiment polarity of users publishing sentiment data. Although traditional classification algorithm can be used to train sentiment classifiers from manually labelled text data, the labelling work can be time-consuming and expensive. Meanwhile, users often use some different words when they express sentiment in different domains. If we directly apply a classifier trained in one domain to other domains, the performance will be very low due to the difference between these domains. In this work, we develop a general solution to sentiment classification when we do not have any labels in target domain but have some labelled data in a different domain, regarded as source domain.


2010 ◽  
Vol 139 (11) ◽  
pp. 1661-1671 ◽  
Author(s):  
D. H. GROVE-WHITE ◽  
A. J. H. LEATHERBARROW ◽  
P. J. CRIPPS ◽  
P. J. DIGGLE ◽  
N. P. FRENCH

SUMMARYMulti-locus sequence typing was performed on 1003Campylobacter jejuniisolates collected in a 2-year longitudinal study of 15 dairy farms and four sheep farms in Lancashire, UK. There was considerable farm-level variation in occurrence and prevalence of clonal complexes (CC). Clonal complexes ST61, ST21, ST403 and ST45 were most prevalent in cattle while in sheep CC ST42, ST21, ST48 and ST52 were most prevalent. CC ST45, a complex previously shown to be more common in summer months in human cases, was more prevalent in summer in our ruminant samples. Gene flow analysis demonstrated a high level of genetic heterogeneity at the within-farm level. Sequence-type diversity was greater in cattle compared to sheep, in cattle at pasturevs. housed, and in isolates from farms on the Pennines compared to the Southern Fylde. Sequence-type diversity was greatest in isolates belonging to CC ST21, ST45 and ST206.


Author(s):  
Kengatharaiyer Sarveswaran ◽  
Gihan Dias ◽  
Miriam Butt

AbstractThis paper presents an open source and extendable Morphological Analyser cum Generator (MAG) for Tamil named ThamizhiMorph. Tamil is a low-resource language in terms of NLP processing tools and applications. In addition, most of the available tools are neither open nor extendable. A morphological analyser is a key resource for the storage and retrieval of morphophonological and morphosyntactic information, especially for morphologically rich languages, and is also useful for developing applications within Machine Translation. This paper describes how ThamizhiMorph is designed using a Finite-State Transducer (FST) and implemented using Foma. We discuss our design decisions based on the peculiarities of Tamil and its nominal and verbal paradigms. We specify a high-level meta-language to efficiently characterise the language’s inflectional morphology. We evaluate ThamizhiMorph using text from a Tamil textbook and the Tamil Universal Dependency treebank version 2.5. The evaluation and error analysis attest a very high performance level, with the identified errors being mostly due to out-of-vocabulary items, which are easily fixable. In order to foster further development, we have made our scripts, the FST models, lexicons, Meta-Morphological rules, lists of generated verbs and nouns, and test data sets freely available for others to use and extend upon.


Sign in / Sign up

Export Citation Format

Share Document