A Bag of Concepts Approach for Biomedical Document Classification Using Wikipedia Knowledge

2017 ◽  
Vol 56 (05) ◽  
pp. 370-376 ◽  
Author(s):  
Roberto Pérez-Rodríguez ◽  
Luis E. Anido-Rifón ◽  
Marcos A. Mouriño-García

SummaryObjectives: The ability to efficiently review the existing literature is essential for the rapid progress of research. This paper describes a classifier of text documents, represented as vectors in spaces of Wikipedia concepts, and analyses its suitability for classification of Spanish biomedical documents when only English documents are available for training. We propose the cross-language concept matching (CLCM) technique, which relies on Wikipedia interlanguage links to convert concept vectors from the Spanish to the English space.Methods: The performance of the classifier is compared to several baselines: a classifier based on machine translation, a classifier that represents documents after performing Explicit Semantic Analysis (ESA), and a classifier that uses a domain-specific semantic an- notator (MetaMap). The corpus used for the experiments (Cross-Language UVigoMED) was purpose-built for this study, and it is composed of 12,832 English and 2,184 Spanish MEDLINE abstracts.Results: The performance of our approach is superior to any other state-of-the art classifier in the benchmark, with performance increases up to: 124% over classical machine translation, 332% over MetaMap, and 60 times over the classifier based on ESA. The results have statistical significance, showing p-values < 0.0001.Conclusion: Using knowledge mined from Wikipedia to represent documents as vectors in a space of Wikipedia concepts and translating vectors between language-specific concept spaces, a cross-language classifier can be built, and it performs better than several state-of-the-art classifiers.

Author(s):  
Junaid Rashid ◽  
Syed Muhammad Adnan Shah ◽  
Aun Irtaza

Topic modeling is an effective text mining and information retrieval approach to organizing knowledge with various contents under a specific topic. Text documents in form of news articles are increasing very fast on the web. Analysis of these documents is very important in the fields of text mining and information retrieval. Meaningful information extraction from these documents is a challenging task. One approach for discovering the theme from text documents is topic modeling but this approach still needs a new perspective to improve its performance. In topic modeling, documents have topics and topics are the collection of words. In this paper, we propose a new k-means topic modeling (KTM) approach by using the k-means clustering algorithm. KTM discovers better semantic topics from a collection of documents. Experiments on two real-world Reuters 21578 and BBC News datasets show that KTM performance is better than state-of-the-art topic models like LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Analysis). The KTM is also applicable for classification and clustering tasks in text mining and achieves higher performance with a comparison of its competitors LDA and LSA.


Author(s):  
Elangovan Ramanujam ◽  
S. Padmavathi

Innovations and applicability of time series data mining techniques have significantly increased the researchers' interest in the problem of time series classification. Several algorithms have been proposed for this purpose categorized under shapelet, interval, motif, and whole series-based techniques. Among this, the bag-of-words technique, an extensive application of the text mining approach, performs well due to its simplicity and effectiveness. To extend the efficiency of the bag-of-words technique, this paper proposes a discriminate supervised weighted scheme to identify the characteristic and representative pattern of a class for efficient classification. This paper uses a modified weighted matrix that discriminates the representative and non-representative pattern which enables the interpretability in classification. Experimentation has been carried out to compare the performance of the proposed technique with state-of-the-art techniques in terms of accuracy and statistical significance.


Author(s):  
Josef Steinberger ◽  
Ralf Steinberger ◽  
Hristo Tanev ◽  
Vanni Zavarella ◽  
Marco Turchi

In this chapter, the authors discuss several pertinent aspects of an automatic system that generates summaries in multiple languages for sets of topic-related news articles (multilingual multi-document summarisation), gathered by news aggregation systems. The discussion follows a framework based on Latent Semantic Analysis (LSA) because LSA was shown to be a high-performing method across many different languages. Starting from a sentence-extractive approach, the authors show how domain-specific aspects can be used and how a compression and paraphrasing method can be plugged in. They also discuss the challenging problem of summarisation evaluation in different languages. In particular, the authors describe two approaches: the first uses a parallel corpus and the second statistical machine translation.


Author(s):  
Scott Blunsden ◽  
Robert Fisher

This chapter presents a way to classify interactions between people. Examples of the interactions we investigate are: people meeting one another, walking together, and fighting. A new feature set is proposed along with a corresponding classification method. Results are presented which show the new method performing significantly better than the previous state of the art method as proposed by Oliver et al. (2000).


Author(s):  
Zilu Guo ◽  
Zhongqiang Huang ◽  
Kenny Q. Zhu ◽  
Guandan Chen ◽  
Kaibo Zhang ◽  
...  

Paraphrase generation plays key roles in NLP tasks such as question answering, machine translation, and information retrieval. In this paper, we propose a novel framework for paraphrase generation. It simultaneously decodes the output sentence using a pretrained wordset-to-sequence model and a round-trip translation model. We evaluate this framework on Quora, WikiAnswers, MSCOCO and Twitter, and show its advantage over previous state-of-the-art unsupervised methods and distantly-supervised methods by significant margins on all datasets. For Quora and WikiAnswers, our framework even performs better than some strongly supervised methods with domain adaptation. Further, we show that the generated paraphrases can be used to augment the training data for machine translation to achieve substantial improvements.


1996 ◽  
Vol 05 (04) ◽  
pp. 367-401
Author(s):  
CHAI KIAT YEO ◽  
WAI KONG LAM ◽  
ING YANN SOON

A new approach to machine translation, capable of resolving different meanings of a verb in sentences of varying context, is described. The design revolves around the Verb Usage Frame (VUF) and the Noun Classification Hierarchy (NCH). VUF contains different context items which embody the different semantic usages of a verb under different contexts. The meaning of the verb is resolved through the classifications of its subject and object, achieved through the NCH. NCH returns not just the basic classification of a noun but also its super-classification. This allows thorough semantic analysis of both the verb and the noun. The entire design is implemented using object-oriented techniques and a prototype English-Japanese machine translator is built to illustrate the merits of the design.


Author(s):  
Li Rui ◽  
Zheng Shunyi ◽  
Duan Chenxi ◽  
Yang Yang ◽  
Wang Xiqi

In recent years, more and more researchers have gradually paid attention to Hyperspectral Image (HSI) classification. It is significant to implement researches on how to use HSI's sufficient spectral and spatial information to its fullest potential. To capture spectral and spatial features, we propose a Double-Branch Dual-Attention mechanism network (DBDA) for HSI classification in this paper, Two branches aer designed to extract spectral and spatial features separately to reduce the interferences between these two kinds of features. What is more, because distinguishing characteristics exist in the two branches, two types of attention mechanisms are applied in two branches above separately, ensuring to exploit spectral and spatial features more discriminatively. Finally, the extracted features are fused for classification. A series of empirical studies have been conducted on four hyperspectral datasets, and the results show that the proposed method performs better than the state-of-the-art method.


2016 ◽  
Vol 28 (2) ◽  
pp. 257-285 ◽  
Author(s):  
Sarath Chandar ◽  
Mitesh M. Khapra ◽  
Hugo Larochelle ◽  
Balaraman Ravindran

Common representation learning (CRL), wherein different descriptions (or views) of the data are embedded in a common subspace, has been receiving a lot of attention recently. Two popular paradigms here are canonical correlation analysis (CCA)–based approaches and autoencoder (AE)–based approaches. CCA-based approaches learn a joint representation by maximizing correlation of the views when projected to the common subspace. AE-based methods learn a common representation by minimizing the error of reconstructing the two views. Each of these approaches has its own advantages and disadvantages. For example, while CCA-based approaches outperform AE-based approaches for the task of transfer learning, they are not as scalable as the latter. In this work, we propose an AE-based approach, correlational neural network (CorrNet), that explicitly maximizes correlation among the views when projected to the common subspace. Through a series of experiments, we demonstrate that the proposed CorrNet is better than AE and CCA with respect to its ability to learn correlated common representations. We employ CorrNet for several cross-language tasks and show that the representations learned using it perform better than the ones learned using other state-of-the-art approaches.


2020 ◽  
Vol 2020 ◽  
pp. 1-10
Author(s):  
Zina Z. R. Al-Shamaa ◽  
Sefer Kurnaz ◽  
Adil Deniz Duru ◽  
Nadia Peppa ◽  
Alex H. Mirnezami ◽  
...  

Imbalanced class distribution in the medical dataset is a challenging task that hinders classifying disease correctly. It emerges when the number of healthy class instances being much larger than the disease class instances. To solve this problem, we proposed undersampling the healthy class instances to improve disease class classification. This model is named Hellinger Distance Undersampling (HDUS). It employs the Hellinger Distance to measure the resemblance between majority class instance and its neighbouring minority class instances to separate classes effectively and boost the discrimination power for each class. An extensive experiment has been conducted on four imbalanced medical datasets using three classifiers to compare HDUS with a baseline model and three state-of-the-art undersampling models. The outcomes display that HDUS can perform better than other models in terms of sensitivity, F1 measure, and balanced accuracy.


Plants ◽  
2020 ◽  
Vol 9 (10) ◽  
pp. 1319
Author(s):  
Muhammad Hammad Saleem ◽  
Johan Potgieter ◽  
Khalid Mahmood Arif

Recently, plant disease classification has been done by various state-of-the-art deep learning (DL) architectures on the publicly available/author generated datasets. This research proposed the deep learning-based comparative evaluation for the classification of plant disease in two steps. Firstly, the best convolutional neural network (CNN) was obtained by conducting a comparative analysis among well-known CNN architectures along with modified and cascaded/hybrid versions of some of the DL models proposed in the recent researches. Secondly, the performance of the best-obtained model was attempted to improve by training through various deep learning optimizers. The comparison between various CNNs was based on performance metrics such as validation accuracy/loss, F1-score, and the required number of epochs. All the selected DL architectures were trained in the PlantVillage dataset which contains 26 different diseases belonging to 14 respective plant species. Keras with TensorFlow backend was used to train deep learning architectures. It is concluded that the Xception architecture trained with the Adam optimizer attained the highest validation accuracy and F1-score of 99.81% and 0.9978 respectively which is comparatively better than the previous approaches and it proves the novelty of the work. Therefore, the method proposed in this research can be applied to other agricultural applications for transparent detection and classification purposes.


Sign in / Sign up

Export Citation Format

Share Document