training corpus Latest Research Papers

Abstract We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as a within-domain test set, and Kiev Folia is used as an out-of-domain test set. Analysing by-PoS-class precision and sensitivity in each run, we combine a simple context-free n-gram-based approach and Hidden Markov method (HMM), and added linguistic rules for specific cases such as punctuation and digits. While the model achieves a rather non-impressive accuracy of 81% in in-domain settings, we observe an accuracy of 51% in out-of-domain evaluation, which is comparable to the results of large neural architectures based on pre-trained contextual embeddings.

Download Full-text

Using Shallow and Deep Learning to Automatically Detect Hate Motivated by Gender and Sexual Orientation on Twitter in Spanish

Multimodal Technologies and Interaction ◽

10.3390/mti5100063 ◽

2021 ◽

Vol 5 (10) ◽

pp. 63

Author(s):

Carlos Arcila-Calderón ◽

Javier J. Amores ◽

Patricia Sánchez-Holgado ◽

David Blanco-Herrero

Keyword(s):

Deep Learning ◽

Sexual Orientation ◽

Ad Hoc ◽

Hate Speech ◽

Learning Algorithm ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Training Corpus ◽

Deep Learning Algorithm ◽

And Gender

The increasing phenomenon of “cyberhate” is concerning because of the potential social implications of this form of verbal violence, which is aimed at already-stigmatized social groups. According to information collected by the Ministry of the Interior of Spain, the category of sexual orientation and gender identity is subject to the third-highest number of registered hate crimes, ranking behind racism/xenophobia and ideology. However, most of the existing computational approaches to online hate detection simultaneously attempt to address all types of discrimination, leading to weaker prototype performances. These approaches focus on other reasons for hate—primarily racism and xenophobia—and usually focus on English messages. Furthermore, few detection models have used manually generated databases as a training corpus. Using supervised machine learning techniques, the present research sought to overcome these limitations by developing and evaluating an automatic detector of hate speech motivated by gender and sexual orientation. The focus was Spanish-language posts on Twitter. For this purpose, eight predictive models were developed from an ad hoc generated training corpus, using shallow modeling and deep learning. The evaluation metrics showed that the deep learning algorithm performed significantly better than the shallow modeling algorithms, and logistic regression yielded the best performance of the shallow algorithms.

Download Full-text

Pre-Trained Transformer-Based Language Models for Sundanese

10.21203/rs.3.rs-907893/v1 ◽

2021 ◽

Author(s):

Wilson Wongso ◽

Henry Lucky ◽

Derwin Suhartono

Keyword(s):

Natural Language ◽

Text Classification ◽

Training Data ◽

Language Models ◽

Classification Task ◽

Language Understanding ◽

Training Corpus ◽

Low Resource ◽

Corpus Size ◽

Fine Tune

Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.

Download Full-text

Predicting the N400 ERP component using the Sentence Gestalt model trained on a large scale corpus

10.1101/2021.05.12.443787 ◽

2021 ◽

Author(s):

Alessandro Lopopolo ◽

Milena Rabovsky

Keyword(s):

Sentence Processing ◽

Large Scale ◽

Event Related Potential ◽

Probabilistic Representation ◽

Training Corpus ◽

Sentence Meaning ◽

N400 Component ◽

Linguistic Input ◽

Gestalt Model ◽

Large Corpus

The N400 component of the event-related brain potential is widely used to investigate language and meaning processing. However, despite much research, the component's functional basis remains actively debated. Recent work showed that the update of the predictive representation of sentence meaning (semantic update, or SU) generated by the Sentence Gestalt model (Mcclelland1 et al. 1989) consistently displayed a similar pattern to the N400 amplitude in a series of conditions known to modulate this event-related potential. These results led Rabovsky et al. (2018) to suggest that the N400 might reflect change in a probabilistic representation of meaning corresponding to an implicit semantic prediction error. However, a limitation of this work is that the model was trained on a small artificial training corpus and thus could not be presented with the same naturalistic stimuli presented in empirical experiments. In the present study, we overcome this limitation and directly model the amplitude of the N400 elicited during naturalistic sentence processing by using as predictor the SU generated by a Sentence Gestalt model trained on a large corpus of texts. The results reported in this paper corroborate the hypothesis that the N400 component reflects the change in a probabilistic representation of meaning after every word presentation. Further analyses demonstrate that the SU of the Sentence Gestalt model and the amplitude of the N400 are influenced similarly by the stochastic and positional properties of the linguistic input.

Download Full-text

You can simply rely on communities for a robust characterization of stances

The International FLAIRS Conference Proceedings ◽

10.32473/flairs.v34i1.128515 ◽

2021 ◽

Vol 34 (1) ◽

Author(s):

Damián Ariel Furman ◽

Santiago Marro ◽

Cristian Cardellino ◽

Diana Nicoleta Popa ◽

Laura Alonso Alemany

Keyword(s):

Error Propagation ◽

Class Imbalance ◽

Detection Task ◽

Crowd Sourcing ◽

Training Corpus ◽

Structure Of Communities ◽

Supervised Classifiers ◽

Weakly Supervised

We show that the structure of communities in social me- dia provides robust information for weakly supervised approaches to assign stances to tweets. Using as seed the SemEval 2016 Stance Detection Task annotated data, we retrieved a high number of topically related tweets. We then propagated information from the manually an- notated seed to the retrieved tweets and thus obtained a bigger training corpus. Classifiers trained with this bigger, weakly supervised dataset reach similar or better performance than those trained with the manually annotated seed. In addition, they are more robust with respect to common manual annotator errors or biases and they have arguably more coverage than smaller datasets. Weakly supervised approaches, most notably self- supervision, commonly suffer from error propagation. Interestingly, communities seem to provide a structure that constrains error propagation. In particular, weakly supervised classifiers that incorporate community struc- ture are more robust with respect to class imbalance. Additionally, this is a straightforward, transparent ap- proach, using standard tools and pipelines, cheaper and faster than methods like crowd sourcing for manual an- notations. Thus it facilitates adoption, interpretability and accountability.

Download Full-text

A Sentence Classification Framework to Identify Geometric Errors in Radiation Therapy from Relevant Literature

Information ◽

10.3390/info12040139 ◽

2021 ◽

Vol 12 (4) ◽

pp. 139

Author(s):

Tanmay Basu ◽

Simon Goldsworthy ◽

Georgios V. Gkoutos

Keyword(s):

Language Processing ◽

Systematic Reviews ◽

Data Extraction ◽

Research Question ◽

Relevant Literature ◽

Support Vector ◽

Svm Classifier ◽

Geometric Errors ◽

Training Corpus ◽

Different Types

The objective of systematic reviews is to address a research question by summarizing relevant studies following a detailed, comprehensive, and transparent plan and search protocol to reduce bias. Systematic reviews are very useful in the biomedical and healthcare domain; however, the data extraction phase of the systematic review process necessitates substantive expertise and is labour-intensive and time-consuming. The aim of this work is to partially automate the process of building systematic radiotherapy treatment literature reviews by summarizing the required data elements of geometric errors of radiotherapy from relevant literature using machine learning and natural language processing (NLP) approaches. A framework is developed in this study that initially builds a training corpus by extracting sentences containing different types of geometric errors of radiotherapy from relevant publications. The publications are retrieved from PubMed following a given set of rules defined by a domain expert. Subsequently, the method develops a training corpus by extracting relevant sentences using a sentence similarity measure. A support vector machine (SVM) classifier is then trained on this training corpus to extract the sentences from new publications which contain relevant geometric errors. To demonstrate the proposed approach, we have used 60 publications containing geometric errors in radiotherapy to automatically extract the sentences stating the mean and standard deviation of different types of errors between planned and executed radiotherapy. The experimental results show that the recall and precision of the proposed framework are, respectively, 97% and 72%. The results clearly show that the framework is able to extract almost all sentences containing required data of geometric errors.

Download Full-text

Automatic Diacritic Recovery with focus on the Quality of the training Corpus for Resource-scarce Languages

2020 IEEE 2nd International Conference on Cyberspac (CYBER NIGERIA) ◽

10.1109/cybernigeria51635.2021.9428872 ◽

2021 ◽

Author(s):

Ikechukwu Ignatius Ayogu ◽

Onoja Abu

Keyword(s):

Training Corpus

Download Full-text

A Novel Method for Reducing Overhead of Training Sentiment Analysis Network

10.21203/rs.3.rs-181676/v1 ◽

2021 ◽

Author(s):

Yuxia Lei ◽

Linkun Zhang ◽

Zhengyan Wang ◽

Zhiqiang Kong ◽

Hanru Ma ◽

...

Keyword(s):

Sentiment Analysis ◽

Large Scale ◽

High Accuracy ◽

Formal Concept ◽

Important Research ◽

Research Focus ◽

Training Corpus ◽

Scale Method ◽

Computational Overhead ◽

Novel Method

Abstract Sentiment analysis based on statistics has rapidly developed in deep-learning. Bilateral attention neural network (BANN), especially Bidirectional Encoder Representations from Transformers (BERT), has reached high accuracy. However, with the increase of network depth and large-scale corpus, the computational overhead of BANN increases geometrically. How to reduce training corpus scale has correspondingly become an important research focus. This paper proposes a reduced corpus scale method called Concept-BERT, which consists of the following steps: firstly, using Formal Concept Analysis (FCA), Concept-BERT mines the association rules among corpus and reduces corpus attributes, and hence reducing corpus scale; secondly, reduced-corpus is inputed to BERT and the result is obtained; finally, the attention of Concept-BERT is analyzed. Concept-BERT is experimented for sentiment analysis on CoLA, SST-2, Dianping and Blogsenti, and its accuracy reaches 81.1, 92.9, 77.9 and 86.7 respectively. Our experimental results show that the proposed method has the same accuracy as BERT, using low-scale corpus and low overhead, and low-scale corpus doesn't affect model attention.

Download Full-text

How do blind people know that blue is cold? Distributional semantics encode color-adjective associations.

10.31234/osf.io/vyxpq ◽

2021 ◽

Author(s):

Jeroen van Paridon ◽

Qiawen Liu ◽

Gary Lupyan

Keyword(s):

Projection Method ◽

Effect Size ◽

Visual Experience ◽

Semantic Information ◽

Blind People ◽

Distributional Semantics ◽

Word Embeddings ◽

Training Corpus ◽

Written Text ◽

Congenitally Blind

Certain colors are strongly associated with certain adjectives (e.g. red is hot, blue is cold). Some of these associations are grounded in visual experiences like seeing hot embers glow red. Surprisingly, many congenitally blind people show similar color associations, despite lacking all visual experience of color. Presumably, they learn these associations via language. Can we detect these associations in the statistics of language? And if so, what form do they take? We apply a projection method to word embeddings trained on corpora of spoken and written text to identify color-adjective associations as they are represented in language. We show that these projections are predictive of color-adjective ratings collected from blind and sighted people, and that the effect size depends on the training corpus. Finally, we examine how color-adjective associations might be represented in language by training word embeddings on corpora from which various sources of color-semantic information are removed.

Download Full-text

training corpus
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

GAN for Generating User-Specific Human Activity Data From An Incomplete Training Corpus

An HMM-Based PoS Tagger for Old Church Slavonic

Using Shallow and Deep Learning to Automatically Detect Hate Motivated by Gender and Sexual Orientation on Twitter in Spanish

Pre-Trained Transformer-Based Language Models for Sundanese

Predicting the N400 ERP component using the Sentence Gestalt model trained on a large scale corpus

You can simply rely on communities for a robust characterization of stances

A Sentence Classification Framework to Identify Geometric Errors in Radiation Therapy from Relevant Literature

Automatic Diacritic Recovery with focus on the Quality of the training Corpus for Resource-scarce Languages

A Novel Method for Reducing Overhead of Training Sentiment Analysis Network

How do blind people know that blue is cold? Distributional semantics encode color-adjective associations.

Export Citation Format

training corpusRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

GAN for Generating User-Specific Human Activity Data From An Incomplete Training Corpus

An HMM-Based PoS Tagger for Old Church Slavonic

Using Shallow and Deep Learning to Automatically Detect Hate Motivated by Gender and Sexual Orientation on Twitter in Spanish

Pre-Trained Transformer-Based Language Models for Sundanese

Predicting the N400 ERP component using the Sentence Gestalt model trained on a large scale corpus

You can simply rely on communities for a robust characterization of stances

A Sentence Classification Framework to Identify Geometric Errors in Radiation Therapy from Relevant Literature

Automatic Diacritic Recovery with focus on the Quality of the training Corpus for Resource-scarce Languages

A Novel Method for Reducing Overhead of Training Sentiment Analysis Network

How do blind people know that blue is cold? Distributional semantics encode color-adjective associations.

training corpus
Recently Published Documents