Supervised Ensemble Learning for Vietnamese Tokenization

Author(s):  
Wuying Liu

Vietnamese tokenization is a challenging basic issue, and the corresponding algorithms can be used in many applications of natural language processing. In this paper, we investigate the Vietnamese tokenization problem and propose a supervised ensemble learning (SEL) framework as well as a SEL-based tokenization (SELT) algorithm. Supported by the data structure of syllable-syllable frequency index, the SELT algorithm combines multiple weak tokenizers to form a strong tokenizer. Within the SEL framework, we also investigate the efficient construction problem of a weak tokenizer. We suggest two prediction methods to select a suitable dictionary, and efficiently implement two weak tokenizers by the simple dictionary-based tokenization algorithm. The experimental results show that the SELT algorithm integrating our weak tokenizers can achieve state-of-the-art performance in the Vietnamese tokenization task.

10.2196/17832 ◽  
2020 ◽  
Vol 8 (7) ◽  
pp. e17832
Author(s):  
Kun Zeng ◽  
Zhiwei Pan ◽  
Yibin Xu ◽  
Yingying Qu

Background Eligibility criteria are the main strategy for screening appropriate participants for clinical trials. Automatic analysis of clinical trial eligibility criteria by digital screening, leveraging natural language processing techniques, can improve recruitment efficiency and reduce the costs involved in promoting clinical research. Objective We aimed to create a natural language processing model to automatically classify clinical trial eligibility criteria. Methods We proposed a classifier for short text eligibility criteria based on ensemble learning, where a set of pretrained models was integrated. The pretrained models included state-of-the-art deep learning methods for training and classification, including Bidirectional Encoder Representations from Transformers (BERT), XLNet, and A Robustly Optimized BERT Pretraining Approach (RoBERTa). The classification results by the integrated models were combined as new features for training a Light Gradient Boosting Machine (LightGBM) model for eligibility criteria classification. Results Our proposed method obtained an accuracy of 0.846, a precision of 0.803, and a recall of 0.817 on a standard data set from a shared task of an international conference. The macro F1 value was 0.807, outperforming the state-of-the-art baseline methods on the shared task. Conclusions We designed a model for screening short text classification criteria for clinical trials based on multimodel ensemble learning. Through experiments, we concluded that performance was improved significantly with a model ensemble compared to a single model. The introduction of focal loss could reduce the impact of class imbalance to achieve better performance.


2020 ◽  
Author(s):  
Kun Zeng ◽  
Zhiwei Pan ◽  
Yibin Xu ◽  
Yingying Qu

BACKGROUND Eligibility criteria are the main strategy for screening appropriate participants for clinical trials. Automatic analysis of clinical trial eligibility criteria by digital screening, leveraging natural language processing techniques, can improve recruitment efficiency and reduce the costs involved in promoting clinical research. OBJECTIVE We aimed to create a natural language processing model to automatically classify clinical trial eligibility criteria. METHODS We proposed a classifier for short text eligibility criteria based on ensemble learning, where a set of pretrained models was integrated. The pretrained models included state-of-the-art deep learning methods for training and classification, including Bidirectional Encoder Representations from Transformers (BERT), XLNet, and A Robustly Optimized BERT Pretraining Approach (RoBERTa). The classification results by the integrated models were combined as new features for training a Light Gradient Boosting Machine (LightGBM) model for eligibility criteria classification. RESULTS Our proposed method obtained an accuracy of 0.846, a precision of 0.803, and a recall of 0.817 on a standard data set from a shared task of an international conference. The macro F1 value was 0.807, outperforming the state-of-the-art baseline methods on the shared task. CONCLUSIONS We designed a model for screening short text classification criteria for clinical trials based on multimodel ensemble learning. Through experiments, we concluded that performance was improved significantly with a model ensemble compared to a single model. The introduction of focal loss could reduce the impact of class imbalance to achieve better performance.


2018 ◽  
Author(s):  
Alan Kuhnle ◽  
Taher Mun ◽  
Christina Boucher ◽  
Travis Gagie ◽  
Ben Langmead ◽  
...  

AbstractWhile short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string’s suffix array (SA) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the SA that — when used with the rank data structure — allows us access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that Gagie et al. (SODA 2018) have defined an SA sample that takes about the same space as the run-length compressed BWT — we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the BWT of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.’s SA sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes, and show that it improves over Bowtie with respect to both memory and time.AvailabilityWe note that the implementation of our methods can be found here: https://github.com/alshai/r-index.


2019 ◽  
Vol 53 (2) ◽  
pp. 3-10
Author(s):  
Muthu Kumar Chandrasekaran ◽  
Philipp Mayr

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.


2021 ◽  
pp. 1-13
Author(s):  
Qingtian Zeng ◽  
Xishi Zhao ◽  
Xiaohui Hu ◽  
Hua Duan ◽  
Zhongying Zhao ◽  
...  

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.


2021 ◽  
pp. 1-12
Author(s):  
Yingwen Fu ◽  
Nankai Lin ◽  
Xiaotian Lin ◽  
Shengyi Jiang

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.


2021 ◽  
Vol 25 (2) ◽  
pp. 283-303
Author(s):  
Na Liu ◽  
Fei Xie ◽  
Xindong Wu

Approximate multi-pattern matching is an important issue that is widely and frequently utilized, when the pattern contains variable-length wildcards. In this paper, two suffix array-based algorithms have been proposed to solve this problem. Suffix array is an efficient data structure for exact string matching in existing studies, as well as for approximate pattern matching and multi-pattern matching. An algorithm called MMSA-S is for the short exact characters in a pattern by dynamic programming, while another algorithm called MMSA-L deals with the long exact characters by the edit distance method. Experimental results of Pizza & Chili corpus demonstrate that these two newly proposed algorithms, in most cases, are more time-efficient than the state-of-the-art comparison algorithms.


2021 ◽  
Author(s):  
Danila Piatov ◽  
Sven Helmer ◽  
Anton Dignös ◽  
Fabio Persia

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.


Author(s):  
Tatiana V. Chernigovskaya ◽  

The paper discusses semiotic aspects of higher human functions and a possibility and relevance of traditional search for their neurophysiological basis. The state of the art on the subject is reviewed and the lack of data on anthropological specificity for reasoning, thinking, language and its AI modeling is highlighted. Experimental neuroscience presumes that if we know the characteristics of neu­rons and their connections, we automatically understand what mind and con­sciousness are. However, it is evident that such a paradigm does not allow us to get relevant answers to the main questions. I argue that the problem should be dealt with not only within the field of neurophysiology proper. Rather, such re­search should involve exploring the 'archeology' of mental processes as they are revealed in arts as well as in other symbolic spaces. The paper discusses the ade­quacy of physiological methodology when it is employed to demonstrate brain mechanisms of higher functions. Besides, I explore the relevance of juxta­posing similar data from other biological and artificial intelligent systems. I view language processing, mind and reasoning and 1st person experience (qualia) as human specific features, and questions the possibility of direct testing these phenomena. The paper links genetic, anthropological and neurophysio­logical data to semiotic activity and semiosphere formation as the basis for com­munication. The paper discusses the place of humans in the changing world in the context of new cognitive dimensions.


Author(s):  
Pouneh Shabani-Jadidi

Psycholinguistics encompasses the psychology of language as well as linguistic psychology. Although they might sound similar, they are actually distinct. The first is a branch of linguistics, while the latter is a subdivision of psychology. In the psychology of language, the means are the research tools adopted from psychology and the end is the study of language. However, in linguistic psychology, the means are the data derived from linguistic studies and the end is psychology. This chapter focuses on the first of these two components; that is, the psychology of language. The goal of this chapter is to give a state-of-the-art perspective on the small but growing body of research using psycholinguistic tools to study Persian with a focus on two areas: presenting longstanding debates about the mental lexicon, language impairments and language processing; and introducing a source of data for the linguistic analysis of Persian.


Sign in / Sign up

Export Citation Format

Share Document