Towards Accurate and Efficient Chinese Part-of-Speech Tagging

From the perspective of structural linguistics, we explore paradigmatic and syntagmatic lexical relations for Chinese POS tagging, an important and challenging task for Chinese language processing. Paradigmatic lexical relations are explicitly captured by word clustering on large-scale unlabeled data and are used to design new features to enhance a discriminative tagger. Syntagmatic lexical relations are implicitly captured by syntactic parsing in the constituency formalism, and are utilized via system combination. Experiments on the Penn Chinese Treebank demonstrate the importance of both paradigmatic and syntagmatic relations. Our linguistically motivated, hybrid approaches yield a relative error reduction of 18% in total over state-of-the-art baselines. Despite the effectiveness to boost accuracy, computationally expensive parsers make hybrid systems inappropriate for many realistic NLP applications. In this article, we are also concerned with improving tagging efficiency at test time. In particular, we explore unlabeled data to transfer the predictive power of hybrid models to simple sequence models. Specifically, hybrid systems are utilized to create large-scale pseudo training data for cheap models. Experimental results illustrate that the re-compiled models not only achieve high accuracy with respect to per token classification, but also serve as a front-end to a parser well.

Download Full-text

Exploring the Performance of Tagging for the Classical and the Modern Standard Arabic

Advances in Fuzzy Systems ◽

10.1155/2019/6254649 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10

Author(s):

Dia AbuZeina ◽

Taqieddin Mostafa Abdalbaset

Keyword(s):

Language Processing ◽

Speech Synthesis ◽

Training Data ◽

Modern Standard Arabic ◽

Standard Arabic ◽

Pos Tagging ◽

Holy Quran ◽

The Holy Quran ◽

Unknown Words ◽

Modern Standard

The part of speech (PoS) tagging is a core component in many natural language processing (NLP) applications. In fact, the PoS taggers contribute as a preprocessing step in various NLP tasks, such as syntactic parsing, information extraction, machine translation, and speech synthesis. In this paper, we examine the performance of a modern standard Arabic (MSA) based tagger for the classical (i.e., traditional or historical) Arabic. In this work, we employed the Stanford Arabic model tagger to evaluate the imperative verbs in the Holy Quran. In fact, the Stanford tagger contains 29 tags; however, this work experimentally evaluates just one that is the VB ≡ imperative verb. The testing set contains 741 imperative verbs, which appear in 1,848 positions in the Holy Quran. Despite the previously reported accuracy of the Arabic model of the Stanford tagger, which is 96.26% for all tags and 80.14% for unknown words, the experimental results show that this accuracy is only 7.28% for the imperative verbs. This result promotes the need for further research to expose why the tagging is severely inaccurate for classical Arabic. The performance decline might be an indication of the necessity to distinguish between training data for both classical and MSA Arabic for NLP tasks.

Download Full-text

Tree Kernels for Semantic Role Labeling

Computational Linguistics ◽

10.1162/coli.2008.34.2.193 ◽

2008 ◽

Vol 34 (2) ◽

pp. 193-224 ◽

Cited By ~ 58

Author(s):

Alessandro Moschitti ◽

Daniele Pighin ◽

Roberto Basili

Keyword(s):

Language Processing ◽

Large Scale ◽

Kernel Functions ◽

Feature Representation ◽

Training Data ◽

Support Vector ◽

Feature Engineering ◽

Semantic Role ◽

Learning Approaches ◽

Semantic Role Labeling

The availability of large scale data sets of manually annotated predicate-argument structures has recently favored the use of machine learning approaches to the design of automated semantic role labeling (SRL) systems. The main research in this area relates to the design choices for feature representation and for effective decompositions of the task in different learning models. Regarding the former choice, structural properties of full syntactic parses are largely employed as they represent ways to encode different principles suggested by the linking theory between syntax and semantics. The latter choice relates to several learning schemes over global views of the parses. For example, re-ranking stages operating over alternative predicate-argument sequences of the same sentence have shown to be very effective. In this article, we propose several kernel functions to model parse tree properties in kernel-based machines, for example, perceptrons or support vector machines. In particular, we define different kinds of tree kernels as general approaches to feature engineering in SRL. Moreover, we extensively experiment with such kernels to investigate their contribution to individual stages of an SRL architecture both in isolation and in combination with other traditional manually coded features. The results for boundary recognition, classification, and re-ranking stages provide systematic evidence about the significant impact of tree kernels on the overall accuracy, especially when the amount of training data is small. As a conclusive result, tree kernels allow for a general and easily portable feature engineering method which is applicable to a large family of natural language processing tasks.

Download Full-text

Improving shift-reduce constituency parsing with large-scale unlabeled data

Natural Language Engineering ◽

10.1017/s1351324913000119 ◽

2013 ◽

Vol 21 (1) ◽

pp. 113-138 ◽

Cited By ~ 1

Author(s):

MUHUA ZHU ◽

JINGBO ZHU ◽

HUIZHEN WANG

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

State Of The Art ◽

Unlabeled Data ◽

Experimental Results ◽

Empirical Methods ◽

Part Of Speech

AbstractShift-reduce parsing has been studied extensively for diverse grammars due to the simplicity and running efficiency. However, in the field of constituency parsing, shift-reduce parsers lag behind state-of-the-art parsers. In this paper we propose a semi-supervised approach for advancing shift-reduce constituency parsing. First, we apply the uptraining approach (Petrov, S. et al. 2010. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cambridge, MA, USA, pp. 705–713) to improve part-of-speech taggers to provide better part-of-speech tags to subsequent shift-reduce parsers. Second, we enhance shift-reduce parsing models with novel features that are defined on lexical dependency information. Both stages depend on the use of large-scale unlabeled data. Experimental results show that the approach achieves overall improvements of 1.5 percent and 2.1 percent on English and Chinese data respectively. Moreover, the final parsing accuracies reach 90.9 percent and 82.2 percent respectively, which are comparable with the accuracy of state-of-the-art parsers.

Download Full-text

Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00408 ◽

2021 ◽

Vol 9 ◽

pp. 978-994

Author(s):

Emanuele Bugliarello ◽

Ryan Cotterell ◽

Naoaki Okazaki ◽

Desmond Elliott

Keyword(s):

Language Processing ◽

Large Scale ◽

Meta Analysis ◽

Training Data ◽

Fine Tuning ◽

Controlled Experiments ◽

Unified Framework ◽

Massive Models ◽

Vision And Language ◽

And Task

Abstract Large-scale pretraining and task-specific fine- tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorized into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five vision and language BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.

Download Full-text

Decoding EEG Brain Activity for Multi-Modal Natural Language Processing

Frontiers in Human Neuroscience ◽

10.3389/fnhum.2021.659410 ◽

2021 ◽

Vol 15 ◽

Author(s):

Nora Hollenstein ◽

Cedric Renggli ◽

Benjamin Glaus ◽

Maria Barrett ◽

Marius Troendle ◽

...

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Brain Activity ◽

Training Data ◽

Human Cognition ◽

Special Focus ◽

Eeg Data

Until recently, human behavioral data from reading has mainly been of interest to researchers to understand human cognition. However, these human language processing signals can also be beneficial in machine learning-based natural language processing tasks. Using EEG brain activity for this purpose is largely unexplored as of yet. In this paper, we present the first large-scale study of systematically analyzing the potential of EEG brain activity data for improving natural language processing tasks, with a special focus on which features of the signal are most beneficial. We present a multi-modal machine learning architecture that learns jointly from textual input as well as from EEG features. We find that filtering the EEG signals into frequency bands is more beneficial than using the broadband signal. Moreover, for a range of word embedding types, EEG data improves binary and ternary sentiment classification and outperforms multiple baselines. For more complex tasks such as relation detection, only the contextualized BERT embeddings outperform the baselines in our experiments, which raises the need for further research. Finally, EEG data shows to be particularly promising when limited training data is available.

Download Full-text

FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/622 ◽

2020 ◽

Cited By ~ 1

Author(s):

Zhuang Liu ◽

Degen Huang ◽

Kaiyu Huang ◽

Zhuang Li ◽

Jun Zhao

Keyword(s):

Deep Learning ◽

Text Mining ◽

Language Processing ◽

Large Scale ◽

Language Model ◽

Training Data ◽

Domain Specific ◽

Current State ◽

Language Representation ◽

Financial Domain

There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific language model pre-trained on large-scale financial corpora. In FinBERT, different from BERT, we construct six pre-training tasks covering more knowledge, simultaneously trained on general corpora and financial domain corpora, which can enable FinBERT model better to capture language knowledge and semantic information. The results show that our FinBERT outperforms all current state-of-the-art models. Extensive experimental results demonstrate the effectiveness and robustness of FinBERT. The source code and pre-trained models of FinBERT are available online.

Download Full-text

Large Scale Needs-Based Open Innovation via Automated Semantic Textual Similarity Analysis

Volume 7: 27th International Conference on Design Theory and Methodology ◽

10.1115/detc2015-47358 ◽

2015 ◽

Cited By ~ 1

Author(s):

Cory R. Schaffhausen ◽

Timothy M. Kowalewski

Keyword(s):

Language Processing ◽

Open Innovation ◽

Large Scale ◽

Semantic Analysis ◽

Sentence Length ◽

Training Data ◽

Potential Applications ◽

Crowd Size ◽

Automated Screening ◽

Semantic Textual Similarity

Open innovation often enjoys large quantities of submitted content. Yet the need to effectively process such large quantities of content impede the widespread use of open innovation in practice. This article presents an exploration of needs-based open innovation using state-of-the art natural language processing (NLP) algorithms to address existing limitations of exploiting large amounts of incoming data. The Semantic Textual Similarity (STS) algorithms were specifically developed to compare sentence-length text passages and were used to rate the semantic similarity of pairs of text sentences submitted by users of a custom open innovation platform. A total of 341 unique users submitted 1,735 textual problem statements or unmet needs relating to multiple topics: cooking, cleaning, and travel. Scores of equivalence generated by a consensus of ten human evaluators for a subset of the needs provided a benchmark for similarity comparison. The semantic analysis allowed for rapid (1 day per topic), automated screening of redundancy to facilitate identification of quality submissions. In addition, a series of permutation analyses provided critical crowd characteristics for the rates of redundant entries as crowd size increases. The results identify top modern STS algorithms for needfinding. These predicted similarity with Pearson correlations of up to .85 when trained using need-based training data and up to .83 when trained using generalized data. Rates of duplication varied with crowd size and may be approximately linear or appear asymptotic depending on the degree of similarity used as a cutoff. Semantic algorithm performance has shown rapid improvements in recent years. Potential applications to screen duplicates and also to screen highly unique sentences for rapid exploration of a space are discussed.

Download Full-text

Translating Videos into Synthetic Training Data for Wearable Sensor-Based Activity Recognition Systems Using Residual Deep Convolutional Networks

Applied Sciences ◽

10.3390/app11073094 ◽

2021 ◽

Vol 11 (7) ◽

pp. 3094

Author(s):

Vitor Fortes Rey ◽

Kamalveer Kaur Garewal ◽

Paul Lukowicz

Keyword(s):

Computer Vision ◽

Regression Model ◽

Activity Recognition ◽

Language Processing ◽

Large Scale ◽

Simulated Data ◽

Training Data ◽

Sensor Data ◽

Activity Data ◽

Data Set

Human activity recognition (HAR) using wearable sensors has benefited much less from recent advances in Deep Learning than fields such as computer vision and natural language processing. This is, to a large extent, due to the lack of large scale (as compared to computer vision) repositories of labeled training data for sensor-based HAR tasks. Thus, for example, ImageNet has images for around 100,000 categories (based on WordNet) with on average 1000 images per category (therefore up to 100,000,000 samples). The Kinetics-700 video activity data set has 650,000 video clips covering 700 different human activities (in total over 1800 h). By contrast, the total length of all sensor-based HAR data sets in the popular UCI machine learning repository is less than 63 h, with around 38 of those consisting of simple mode of locomotion activities like walking, standing or cycling. In our research we aim to facilitate the use of online videos, which exist in ample quantities for most activities and are much easier to label than sensor data, to simulate labeled wearable motion sensor data. In previous work we already demonstrated some preliminary results in this direction, focusing on very simple, activity specific simulation models and a single sensor modality (acceleration norm). In this paper, we show how we can train a regression model on generic motions for both accelerometer and gyro signals and then apply it to videos of the target activities to generate synthetic Inertial Measurement Units (IMU) data (acceleration and gyro norms) that can be used to train and/or improve HAR models. We demonstrate that systems trained on simulated data generated by our regression model can come to within around 10% of the mean F1 score of a system trained on real sensor data. Furthermore, we show that by either including a small amount of real sensor data for model calibration or simply leveraging the fact that (in general) we can easily generate much more simulated data from video than we can collect its real version, the advantage of the latter can eventually be equalized.

Download Full-text

Mixture of Expert/Imitator Networks: Scalable Semi-Supervised Learning Framework

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014073 ◽

2019 ◽

Vol 33 ◽

pp. 4073-4081 ◽

Cited By ~ 1

Author(s):

Shun Kiyono ◽

Jun Suzuki ◽

Kentaro Inui

Keyword(s):

Language Processing ◽

Deep Neural Networks ◽

Semisupervised Learning ◽

Unlabeled Data ◽

Training Data ◽

Performance Property ◽

Learning Framework ◽

Classification Tasks ◽

Expert Network ◽

Label Distribution

The current success of deep neural networks (DNNs) in an increasingly broad range of tasks involving artificial intelligence strongly depends on the quality and quantity of labeled training data. In general, the scarcity of labeled data, which is often observed in many natural language processing tasks, is one of the most important issues to be addressed. Semisupervised learning (SSL) is a promising approach to overcoming this issue by incorporating a large amount of unlabeled data. In this paper, we propose a novel scalable method of SSL for text classification tasks. The unique property of our method, Mixture of Expert/Imitator Networks, is that imitator networks learn to “imitate” the estimated label distribution of the expert network over the unlabeled data, which potentially contributes a set of features for the classification. Our experiments demonstrate that the proposed method consistently improves the performance of several types of baseline DNNs. We also demonstrate that our method has the more data, better performance property with promising scalability to the amount of unlabeled data.

Download Full-text

Deep Embedding Sentiment Analysis on Product Reviews Using Naive Bayesian Classifier

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit1952178 ◽

2019 ◽

pp. 858-864

Author(s):

Nukabathini Mary Saroj Sahithya ◽

Manda Prathyusha ◽

Nakkala Rachana ◽

Perikala Priyanka ◽

P. J. Jyothi

Keyword(s):

Deep Learning ◽

Language Processing ◽

Large Scale ◽

Opinion Mining ◽

Machine Learning Algorithms ◽

Sentiment Classification ◽

Training Data ◽

Fine Tuning ◽

Product Reviews ◽

Deep Embedding

Product reviews are valuable for upcoming buyers in helping them make decisions. To this end, different opinion mining techniques have been proposed, where judging a review sentence�s orientation (e.g. positive or negative) is one of their key challenges. Recently, deep learning has emerged as an effective means for solving sentiment classification problems. Deep learning is a class of machine learning algorithms that learn in supervised and unsupervised manners. A neural network intrinsically learns a useful representation automatically without human efforts. However, the success of deep learning highly relies on the large-scale training data. We propose a novel deep learning framework for product review sentiment classification which employs prevalently available ratings supervision signals. The framework consists of two steps: (1) learning a high-level representation (an embedding space) which captures the general sentiment distribution of sentences through rating information; (2) adding a category layer on top of the embedding layer and use labelled sentences for supervised fine-tuning. We explore two kinds of low-level network structure for modelling review sentences, namely, convolutional function extractors and long temporary memory. Convolutional layer is the core building block of a CNN and it consists of kernels. Applications are image and video recognition, natural language processing, image classification

Download Full-text