Supervised Ensemble Learning for Vietnamese Tokenization

Vietnamese tokenization is a challenging basic issue, and the corresponding algorithms can be used in many applications of natural language processing. In this paper, we investigate the Vietnamese tokenization problem and propose a supervised ensemble learning (SEL) framework as well as a SEL-based tokenization (SELT) algorithm. Supported by the data structure of syllable-syllable frequency index, the SELT algorithm combines multiple weak tokenizers to form a strong tokenizer. Within the SEL framework, we also investigate the efficient construction problem of a weak tokenizer. We suggest two prediction methods to select a suitable dictionary, and efficiently implement two weak tokenizers by the simple dictionary-based tokenization algorithm. The experimental results show that the SELT algorithm integrating our weak tokenizers can achieve state-of-the-art performance in the Vietnamese tokenization task.

Download Full-text

An Ensemble Learning Strategy for Eligibility Criteria Text Classification for Clinical Trial Recruitment: Algorithm Development and Validation

JMIR Medical Informatics ◽

10.2196/17832 ◽

2020 ◽

Vol 8 (7) ◽

pp. e17832

Author(s):

Kun Zeng ◽

Zhiwei Pan ◽

Yibin Xu ◽

Yingying Qu

Keyword(s):

Clinical Trial ◽

Clinical Trials ◽

Natural Language Processing ◽

Ensemble Learning ◽

Language Processing ◽

Text Classification ◽

State Of The Art ◽

Shared Task ◽

Eligibility Criteria ◽

Short Text

Background Eligibility criteria are the main strategy for screening appropriate participants for clinical trials. Automatic analysis of clinical trial eligibility criteria by digital screening, leveraging natural language processing techniques, can improve recruitment efficiency and reduce the costs involved in promoting clinical research. Objective We aimed to create a natural language processing model to automatically classify clinical trial eligibility criteria. Methods We proposed a classifier for short text eligibility criteria based on ensemble learning, where a set of pretrained models was integrated. The pretrained models included state-of-the-art deep learning methods for training and classification, including Bidirectional Encoder Representations from Transformers (BERT), XLNet, and A Robustly Optimized BERT Pretraining Approach (RoBERTa). The classification results by the integrated models were combined as new features for training a Light Gradient Boosting Machine (LightGBM) model for eligibility criteria classification. Results Our proposed method obtained an accuracy of 0.846, a precision of 0.803, and a recall of 0.817 on a standard data set from a shared task of an international conference. The macro F1 value was 0.807, outperforming the state-of-the-art baseline methods on the shared task. Conclusions We designed a model for screening short text classification criteria for clinical trials based on multimodel ensemble learning. Through experiments, we concluded that performance was improved significantly with a model ensemble compared to a single model. The introduction of focal loss could reduce the impact of class imbalance to achieve better performance.

Download Full-text

An Ensemble Learning Strategy for Eligibility Criteria Text Classification for Clinical Trial Recruitment: Algorithm Development and Validation (Preprint)

10.2196/preprints.17832 ◽

2020 ◽

Author(s):

Kun Zeng ◽

Zhiwei Pan ◽

Yibin Xu ◽

Yingying Qu

Keyword(s):

Clinical Trial ◽

Clinical Trials ◽

Natural Language Processing ◽

Ensemble Learning ◽

Language Processing ◽

Text Classification ◽

State Of The Art ◽

Shared Task ◽

Eligibility Criteria ◽

Short Text

BACKGROUND Eligibility criteria are the main strategy for screening appropriate participants for clinical trials. Automatic analysis of clinical trial eligibility criteria by digital screening, leveraging natural language processing techniques, can improve recruitment efficiency and reduce the costs involved in promoting clinical research. OBJECTIVE We aimed to create a natural language processing model to automatically classify clinical trial eligibility criteria. METHODS We proposed a classifier for short text eligibility criteria based on ensemble learning, where a set of pretrained models was integrated. The pretrained models included state-of-the-art deep learning methods for training and classification, including Bidirectional Encoder Representations from Transformers (BERT), XLNet, and A Robustly Optimized BERT Pretraining Approach (RoBERTa). The classification results by the integrated models were combined as new features for training a Light Gradient Boosting Machine (LightGBM) model for eligibility criteria classification. RESULTS Our proposed method obtained an accuracy of 0.846, a precision of 0.803, and a recall of 0.817 on a standard data set from a shared task of an international conference. The macro F1 value was 0.807, outperforming the state-of-the-art baseline methods on the shared task. CONCLUSIONS We designed a model for screening short text classification criteria for clinical trials based on multimodel ensemble learning. Through experiments, we concluded that performance was improved significantly with a model ensemble compared to a single model. The introduction of focal loss could reduce the impact of class imbalance to achieve better performance.

Download Full-text

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

10.1101/472423 ◽

2018 ◽

Author(s):

Alan Kuhnle ◽

Taher Mun ◽

Christina Boucher ◽

Travis Gagie ◽

Ben Langmead ◽

...

Keyword(s):

Data Structure ◽

State Of The Art ◽

Suffix Array ◽

Genomic Databases ◽

Run Length ◽

Slowing Down ◽

Human Genomes ◽

Efficient Construction ◽

Main Components ◽

Burrows Wheeler Transform

AbstractWhile short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string’s suffix array (SA) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the SA that — when used with the rank data structure — allows us access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that Gagie et al. (SODA 2018) have defined an SA sample that takes about the same space as the run-length compressed BWT — we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the BWT of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.’s SA sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes, and show that it improves over Bowtie with respect to both memory and time.AvailabilityWe note that the implementation of our methods can be found here: https://github.com/alshai/r-index.

Download Full-text

Report on the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries at SIGIR 2019

ACM SIGIR Forum ◽

10.1145/3458553.3458554 ◽

2019 ◽

Vol 53 (2) ◽

pp. 3-10

Author(s):

Muthu Kumar Chandrasekaran ◽

Philipp Mayr

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Research And Development ◽

Language Processing ◽

Digital Libraries ◽

State Of The Art ◽

Shared Task ◽

Processing Information ◽

Joint Workshop

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.

Download Full-text

Learning emotional word embeddings for sentiment analysis

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201993 ◽

2021 ◽

pp. 1-13

Author(s):

Qingtian Zeng ◽

Xishi Zhao ◽

Xiaohui Hu ◽

Hua Duan ◽

Zhongying Zhao ◽

...

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

State Of The Art ◽

Research Problem ◽

Emotional Word ◽

Classification Model ◽

Data Sets ◽

Word Embeddings ◽

Real World Data ◽

Text Documents

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.

Download Full-text

Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202286 ◽

2021 ◽

pp. 1-12

Author(s):

Yingwen Fu ◽

Nankai Lin ◽

Xiaotian Lin ◽

Shengyi Jiang

Keyword(s):

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Neural Models ◽

Performance Models ◽

Named Entity ◽

High Resource ◽

Benchmark Datasets

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.

Download Full-text

Suffix array for multi-pattern matching with variable length wildcards

Intelligent Data Analysis ◽

10.3233/ida-205087 ◽

2021 ◽

Vol 25 (2) ◽

pp. 283-303

Author(s):

Na Liu ◽

Fei Xie ◽

Xindong Wu

Keyword(s):

Dynamic Programming ◽

Data Structure ◽

Pattern Matching ◽

Edit Distance ◽

State Of The Art ◽

Suffix Array ◽

Variable Length ◽

Distance Method ◽

Efficient Data ◽

Comparison Algorithms

Approximate multi-pattern matching is an important issue that is widely and frequently utilized, when the pattern contains variable-length wildcards. In this paper, two suffix array-based algorithms have been proposed to solve this problem. Suffix array is an efficient data structure for exact string matching in existing studies, as well as for approximate pattern matching and multi-pattern matching. An algorithm called MMSA-S is for the short exact characters in a pattern by dynamic programming, while another algorithm called MMSA-L deals with the long exact characters by the edit distance method. Experimental results of Pizza & Chili corpus demonstrate that these two newly proposed algorithms, in most cases, are more time-efficient than the state-of-the-art comparison algorithms.

Download Full-text

Cache-efficient sweeping-based interval joins for extended Allen relation predicates

The VLDB Journal ◽

10.1007/s00778-020-00650-5 ◽

2021 ◽

Author(s):

Danila Piatov ◽

Sven Helmer ◽

Anton Dignös ◽

Fabio Persia

Keyword(s):

Data Structure ◽

Experimental Evaluation ◽

State Of The Art ◽

Temporal Databases ◽

Access Method ◽

Wide Range ◽

Interval Relation ◽

Cache Efficient ◽

Join Algorithms ◽

Better Than

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.

Download Full-text

More on Brain and Semiosis: Can We Find a Point in Neuronets?

Вопросы философии ◽

10.21146/0042-8744-2021-6-5-13 ◽

2021 ◽

pp. 5-13

Author(s):

Tatiana V. Chernigovskaya ◽

Keyword(s):

Language Processing ◽

Intelligent Systems ◽

State Of The Art ◽

Similar Data ◽

Direct Testing ◽

The Subject ◽

Neurophysiological Basis ◽

Person Experience ◽

Cognitive Dimensions ◽

Human Specific

The paper discusses semiotic aspects of higher human functions and a possibility and relevance of traditional search for their neurophysiological basis. The state of the art on the subject is reviewed and the lack of data on anthropological specificity for reasoning, thinking, language and its AI modeling is highlighted. Experimental neuroscience presumes that if we know the characteristics of neurons and their connections, we automatically understand what mind and consciousness are. However, it is evident that such a paradigm does not allow us to get relevant answers to the main questions. I argue that the problem should be dealt with not only within the field of neurophysiology proper. Rather, such research should involve exploring the 'archeology' of mental processes as they are revealed in arts as well as in other symbolic spaces. The paper discusses the adequacy of physiological methodology when it is employed to demonstrate brain mechanisms of higher functions. Besides, I explore the relevance of juxtaposing similar data from other biological and artificial intelligent systems. I view language processing, mind and reasoning and 1st person experience (qualia) as human specific features, and questions the possibility of direct testing these phenomena. The paper links genetic, anthropological and neurophysiological data to semiotic activity and semiosphere formation as the basis for communication. The paper discusses the place of humans in the changing world in the context of new cognitive dimensions.

Download Full-text

Psycholinguistics

10.1093/oxfordhb/9780198736745.013.17 ◽

2018 ◽

Author(s):

Pouneh Shabani-Jadidi

Keyword(s):

Language Processing ◽

State Of The Art ◽

Mental Lexicon ◽

Linguistic Analysis ◽

Language Impairments ◽

Growing Body ◽

Research Tools

Psycholinguistics encompasses the psychology of language as well as linguistic psychology. Although they might sound similar, they are actually distinct. The first is a branch of linguistics, while the latter is a subdivision of psychology. In the psychology of language, the means are the research tools adopted from psychology and the end is the study of language. However, in linguistic psychology, the means are the data derived from linguistic studies and the end is psychology. This chapter focuses on the first of these two components; that is, the psychology of language. The goal of this chapter is to give a state-of-the-art perspective on the small but growing body of research using psycholinguistic tools to study Persian with a focus on two areas: presenting longstanding debates about the mental lexicon, language impairments and language processing; and introducing a source of data for the linguistic analysis of Persian.

Download Full-text