Improved N-Best Extraction with an Evaluation on Language Data

Abstract We show that a previously proposed algorithm for the N-best trees problem can be made more efficient by changing how it arranges and explores the search space. Given an integer N and a weighted tree automaton (wta) M over the tropical semiring, the algorithm computes N trees of minimal weight with respect to M. Compared to the original algorithm, the modifications increase the laziness of the evaluation strategy, which makes the new algorithm asymptotically more efficient than its predecessor. The algorithm is implemented in the software Betty, and compared to the state-of-the-art algorithm for extracting the N best runs, implemented in the software toolkit Tiburon. The data sets used in the experiments are wtas resulting from real-world natural language processing tasks, as well as artificially created wtas with varying degrees of nondeterminism. We find that Betty outperforms Tiburon on all tested data sets with respect to running time, while Tiburon seems to be the more memory-efficient choice.

Download Full-text

Learning emotional word embeddings for sentiment analysis

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201993 ◽

2021 ◽

pp. 1-13

Author(s):

Qingtian Zeng ◽

Xishi Zhao ◽

Xiaohui Hu ◽

Hua Duan ◽

Zhongying Zhao ◽

...

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

State Of The Art ◽

Research Problem ◽

Emotional Word ◽

Classification Model ◽

Data Sets ◽

Word Embeddings ◽

Real World Data ◽

Text Documents

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.

Download Full-text

GA2RM: A GA-Based Action Rule Mining Method

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026821500127 ◽

2021 ◽

pp. 2150012

Author(s):

Shervin Hashemi ◽

Pirooz Shamsinejad

Keyword(s):

State Of The Art ◽

Synthetic Data ◽

Search Space ◽

Data Sets ◽

Mining Method ◽

Classification Rules ◽

Classification Methods ◽

Rule Mining ◽

Current State ◽

Traditional Classification

Action Mining is a subfield of Data Mining that tries to extract actions from traditional data sets. Action Rule is a type of rule that suggests some changes in its consequent part. Extracting action rules from data has been one of the research interests in recent years. Current state-of-the-art action rule mining methods like DEAR typically take classification rules as their input; Since traditional classification methods have been designed for prediction and not for manipulation, therefore extracting action rules directly from data can result in more valuable action rules. Here, we have proposed a method to generate action rules directly from data. To tackle the problem of huge search space of action rules, a Genetic Algorithm has been devised. Different metrics have been defined for investigating the effectiveness of our proposed method and a large number of experiments have been done on real and synthetic data sets. The results show that our method can find from 20% to 10 times more interesting (in case of support and confidence) action rules in comparison with its competitors.

Download Full-text

Joint Character-Level Word Embedding and Adversarial Stability Training to Defend Adversarial Text

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6356 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8384-8391

Author(s):

Hui Liu ◽

Yongzheng Zhang ◽

Yipeng Wang ◽

Zheng Lin ◽

Yige Chen

Keyword(s):

Language Processing ◽

Text Classification ◽

State Of The Art ◽

Word Embedding ◽

Data Sets ◽

Basic Task ◽

Gradient Based ◽

Adversarial Examples ◽

Stability Training ◽

Adversarial Example

Text classification is a basic task in natural language processing, but the small character perturbations in words can greatly decrease the effectiveness of text classification models, which is called character-level adversarial example attack. There are two main challenges in character-level adversarial examples defense, which are out-of-vocabulary words in word embedding model and the distribution difference between training and inference. Both of these two challenges make the character-level adversarial examples difficult to defend. In this paper, we propose a framework which jointly uses the character embedding and the adversarial stability training to overcome these two challenges. Our experimental results on five text classification data sets show that the models based on our framework can effectively defend character-level adversarial examples, and our models can defend 93.19% gradient-based adversarial examples and 94.83% natural adversarial examples, which outperforms the state-of-the-art defense models.

Download Full-text

We are not ready yet: limitations of transfer learning for Disease Named Entity Recognition

10.1101/2021.07.11.451939 ◽

2021 ◽

Author(s):

Lisa Langnickel ◽

Juliane Fluck

Keyword(s):

Transfer Learning ◽

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Clinical Applications ◽

Entity Recognition ◽

Data Sets ◽

Named Entity ◽

Current State ◽

Biomedical Named Entity Recognition

Intense research has been done in the area of biomedical natural language processing. Since the breakthrough of transfer learning-based methods, BERT models are used in a variety of biomedical and clinical applications. For the available data sets, these models show excellent results - partly exceeding the inter-annotator agreements. However, biomedical named entity recognition applied on COVID-19 preprints shows a performance drop compared to the results on available test data. The question arises how well trained models are able to predict on completely new data, i.e. to generalize. Based on the example of disease named entity recognition, we investigate the robustness of different machine learning-based methods - thereof transfer learning - and show that current state-of-the-art methods work well for a given training and the corresponding test set but experience a significant lack of generalization when applying to new data. We therefore argue that there is a need for larger annotated data sets for training and testing.

Download Full-text

TrustSVD: A Novel Trust-Based Matrix Factorization Model with User Trust and Item Ratings

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i11.422 ◽

2017 ◽

Vol 7 (11) ◽

pp. 7 ◽

Cited By ~ 1

Author(s):

K Sobha Rani

Keyword(s):

Matrix Factorization ◽

Social Trust ◽

State Of The Art ◽

Data Sets ◽

Real World Data ◽

Recommendation Algorithm ◽

Active User ◽

Factorization Model ◽

The Social ◽

Matrix Factorization Technique

Collaborative filtering suffers from the problems of data sparsity and cold start, which dramatically degrade recommendation performance. To help resolve these issues, we propose TrustSVD, a trust-based matrix factorization technique. By analyzing the social trust data from four real-world data sets, we conclude that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Hence, we build on top of a state-of-the-art recommendation algorithm SVD++ which inherently involves the explicit and implicit influence of rated items, by further incorporating both the explicit and implicit influence of trusted users on the prediction of items for an active user. To our knowledge, the work reported is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that our approach TrustSVD achieves better accuracy than other ten counterparts, and can better handle the concerned issues.

Download Full-text

A novel optimal multi-pattern matching method with wildcards for DNA sequence

Technology and Health Care ◽

10.3233/thc-218012 ◽

2021 ◽

Vol 29 ◽

pp. 115-124

Author(s):

Xinlu Wang ◽

Ahmed A.F. Saif ◽

Dayou Liu ◽

Yungang Zhu ◽

Jon Atli Benediktsson

Keyword(s):

Dna Sequence ◽

Pattern Matching ◽

Health Informatics ◽

State Of The Art ◽

Machine Language ◽

Data Sets ◽

Fundamental Issue ◽

Matching Method ◽

Dna Sequence Alignment ◽

The Given

BACKGROUND: DNA sequence alignment is one of the most fundamental and important operation to identify which gene family may contain this sequence, pattern matching for DNA sequence has been a fundamental issue in biomedical engineering, biotechnology and health informatics. OBJECTIVE: To solve this problem, this study proposes an optimal multi pattern matching with wildcards for DNA sequence. METHODS: This proposed method packs the patterns and a sliding window of texts, and the window slides along the given packed text, matching against stored packed patterns. RESULTS: Three data sets are used to test the performance of the proposed algorithm, and the algorithm was seen to be more efficient than the competitors because its operation is close to machine language. CONCLUSIONS: Theoretical analysis and experimental results both demonstrate that the proposed method outperforms the state-of-the-art methods and is especially effective for the DNA sequence.

Download Full-text

gbt-HIPS: Explaining the Classifications of Gradient Boosted Tree Ensembles

Applied Sciences ◽

10.3390/app11062511 ◽

2021 ◽

Vol 11 (6) ◽

pp. 2511

Author(s):

Julian Hatwell ◽

Mohamed Medhat Gaber ◽

R. Muhammad Atif Azad

Keyword(s):

State Of The Art ◽

Heuristic Method ◽

Good Explanation ◽

Classification Rule ◽

Data Sets ◽

Classification Models ◽

Boundary Values ◽

Class Label ◽

Input Space ◽

Boosted Tree

This research presents Gradient Boosted Tree High Importance Path Snippets (gbt-HIPS), a novel, heuristic method for explaining gradient boosted tree (GBT) classification models by extracting a single classification rule (CR) from the ensemble of decision trees that make up the GBT model. This CR contains the most statistically important boundary values of the input space as antecedent terms. The CR represents a hyper-rectangle of the input space inside which the GBT model is, very reliably, classifying all instances with the same class label as the explanandum instance. In a benchmark test using nine data sets and five competing state-of-the-art methods, gbt-HIPS offered the best trade-off between coverage (0.16–0.75) and precision (0.85–0.98). Unlike competing methods, gbt-HIPS is also demonstrably guarded against under- and over-fitting. A further distinguishing feature of our method is that, unlike much prior work, our explanations also provide counterfactual detail in accordance with widely accepted recommendations for what makes a good explanation.

Download Full-text

Report on the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries at SIGIR 2019

ACM SIGIR Forum ◽

10.1145/3458553.3458554 ◽

2019 ◽

Vol 53 (2) ◽

pp. 3-10

Author(s):

Muthu Kumar Chandrasekaran ◽

Philipp Mayr

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Research And Development ◽

Language Processing ◽

Digital Libraries ◽

State Of The Art ◽

Shared Task ◽

Processing Information ◽

Joint Workshop

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.

Download Full-text

A Hybrid Siamese Neural Network for Natural Language Inference in Cyber-Physical Systems

ACM Transactions on Internet Technology ◽

10.1145/3418208 ◽

2021 ◽

Vol 21 (2) ◽

pp. 1-25

Author(s):

Pin Ni ◽

Yuming Li ◽

Gangmin Li ◽

Victor Chang

Keyword(s):

Natural Language ◽

Language Processing ◽

Short Term Memory ◽

Physical World ◽

Heterogeneous Data ◽

Cyber Physical Systems ◽

Physical Systems ◽

Language Data ◽

Text Language ◽

Different Sources

Cyber-Physical Systems (CPS), as a multi-dimensional complex system that connects the physical world and the cyber world, has a strong demand for processing large amounts of heterogeneous data. These tasks also include Natural Language Inference (NLI) tasks based on text from different sources. However, the current research on natural language processing in CPS does not involve exploration in this field. Therefore, this study proposes a Siamese Network structure that combines Stacked Residual Long Short-Term Memory (bidirectional) with the Attention mechanism and Capsule Network for the NLI module in CPS, which is used to infer the relationship between text/language data from different sources. This model is mainly used to implement NLI tasks and conduct a detailed evaluation in three main NLI benchmarks as the basic semantic understanding module in CPS. Comparative experiments prove that the proposed method achieves competitive performance, has a certain generalization ability, and can balance the performance and the number of trained parameters.

Download Full-text

Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202286 ◽

2021 ◽

pp. 1-12

Author(s):

Yingwen Fu ◽

Nankai Lin ◽

Xiaotian Lin ◽

Shengyi Jiang

Keyword(s):

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Neural Models ◽

Performance Models ◽

Named Entity ◽

High Resource ◽

Benchmark Datasets

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.

Download Full-text