MULTILABEL OVER-SAMPLING AND UNDER-SAMPLING WITH CLASS ALIGNMENT FOR IMBALANCED MULTILABEL TEXT CLASSIFICATION

Simultaneous multiple labelling of documents, also known as multilabel text classification, will not perform optimally if the class is highly imbalanced. Class imbalanced entails skewness in the fundamental data for distribution that leads to more difficulty in classification. Random over-sampling and under-sampling are common approaches to solve the class imbalanced problem. However, these approaches have several drawbacks; the under-sampling is likely to dispose of useful data, whereas the over-sampling can heighten the probability of overfitting. Therefore, a new method that can avoid discarding useful data and overfitting problems is needed. This study proposes a method to tackle the class imbalanced problem by combining multilabel over-sampling and under-sampling with class alignment (ML-OUSCA). In the proposed ML-OUSCA, instead of using all the training instances, it draws a new training set by over-sampling small size classes and under-sampling big size classes. To evaluate our proposed ML-OUSCA, evaluation metrics of average precision, average recall and average F-measure on three benchmark datasets, namely, Reuters-21578, Bibtex, and Enron datasets, were performed. Experimental results showed that the proposed ML-OUSCA outperformed the chosen baseline random resampling approaches; K-means SMOTE and KNN-US. Thus, based on the results, we can conclude that designing a resampling method based on the class imbalanced together with class alignment will improve multilabel classification even better than just the random resampling method.

Download Full-text

An Empirical Investigation on a Multiple Filters-Based Approach for Remaining Useful Life Prediction

Machines ◽

10.3390/machines6030035 ◽

2018 ◽

Vol 6 (3) ◽

pp. 35 ◽

Cited By ~ 1

Author(s):

Hung-Cuong Trinh ◽

Yung-Keun Kwon

Keyword(s):

Scoring Function ◽

Principal Component ◽

Remaining Useful Life ◽

Filter Method ◽

Feature Construction ◽

Training Set ◽

Robust Solution ◽

Benchmark Datasets ◽

Useful Life ◽

Better Than

Feature construction is critical in data-driven remaining useful life (RUL) prediction of machinery systems, and most previous studies have attempted to find a best single-filter method. However, there is no best single filter that is appropriate for all machinery systems. In this work, we devise a straightforward but efficient approach for RUL prediction by combining multiple filters and then reducing the dimension through principal component analysis. We apply multilayer perceptron and random forest methods to learn the underlying model. We compare our approach with traditional single-filtering approaches using two benchmark datasets. The former approach is significantly better than the latter in terms of a scoring function with a penalty for late prediction. In particular, we note that selecting a best single filter over the training set is not efficient because of overfitting. Taken together, we validate that our multiple filters-based approach can be a robust solution for RUL prediction of various machinery systems.

Download Full-text

Unsupervised Outlier Detection in Multidimensional Data

10.21203/rs.3.rs-250665/v1 ◽

2021 ◽

Author(s):

Atiq Rehman ◽

Samir Brahim Belhaouari

Keyword(s):

State Of The Art ◽

Machine Learning Algorithms ◽

Multidimensional Data ◽

High Dimensions ◽

Comprehensive Performance ◽

Benchmark Datasets ◽

Distance Vector ◽

Detection Schemes ◽

Unsupervised Outlier Detection ◽

Better Than

Abstract Detection and removal of outliers in a dataset is a fundamental preprocessing task without which the analysis of the data can be misleading. Furthermore, the existence of anomalies in the data can heavily degrade the performance of machine learning algorithms. In order to detect the anomalies in a dataset in an unsupervised manner, some novel statistical techniques are proposed in this paper. The proposed techniques are based on statistical methods considering data compactness and other properties. The newly proposed ideas are found efficient in terms of performance, ease of implementation, and computational complexity. Furthermore, two proposed techniques presented in this paper use only a single dimensional distance vector to detect the outliers, so irrespective of the data’s high dimensions, the techniques remain computationally inexpensive and feasible. Comprehensive performance analysis of the proposed anomaly detection schemes is presented in the paper, and the newly proposed schemes are found better than the state-of-the-art methods when tested on several benchmark datasets.

Download Full-text

Explicit Interaction Model towards Text Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016359 ◽

2019 ◽

Vol 33 ◽

pp. 6359-6366 ◽

Cited By ~ 3

Author(s):

Cunxiao Du ◽

Zhaozheng Chen ◽

Fuli Feng ◽

Lei Zhu ◽

Tian Gan ◽

...

Keyword(s):

Language Processing ◽

Text Classification ◽

Deep Neural Networks ◽

Interaction Mechanism ◽

Interaction Model ◽

Classification Task ◽

Fine Grained ◽

Word Level ◽

Benchmark Datasets ◽

Classification Tasks

Text classification is one of the fundamental tasks in natural language processing. Recently, deep neural networks have achieved promising performance in the text classification task compared to shallow models. Despite of the significance of deep models, they ignore the fine-grained (matching signals between words and classes) classification clues since their classifications mainly rely on the text-level representations. To address this problem, we introduce the interaction mechanism to incorporate word-level matching signals into the text classification task. In particular, we design a novel framework, EXplicit interAction Model (dubbed as EXAM), equipped with the interaction mechanism. We justified the proposed approach on several benchmark datasets including both multilabel and multi-class text classification tasks. Extensive experimental results demonstrate the superiority of the proposed method. As a byproduct, we have released the codes and parameter settings to facilitate other researches.

Download Full-text

Ant Miner

International Journal of Artificial Intelligence and Machine Learning ◽

10.4018/ijaiml.2020010104 ◽

2020 ◽

Vol 10 (1) ◽

pp. 45-59

Author(s):

Bijaya Kumar Nanda ◽

Satchidananda Dehuri

Keyword(s):

Data Mining ◽

Large Data ◽

Classification Rule ◽

Classification Rules ◽

Rule Mining ◽

Ant Colonies ◽

Benchmark Datasets ◽

Objective Classification ◽

Single Objective ◽

Better Than

In data mining the task of extracting classification rules from large data is an important task and is gaining considerable attention. This article presents a novel ant miner for classification rule mining. The ant miner is inspired by researches on the behaviour of real ant colonies, simulated annealing, and some data mining concepts as well as principles. This paper presents a Pittsburgh style approach for single objective classification rule mining. The algorithm is tested on a few benchmark datasets drawn from UCI repository. The experimental outcomes confirm that ant miner-HPB (Hybrid Pittsburgh Style Classification) is significantly better than ant-miner-PB (Pittsburgh Style Classification).

Download Full-text

Fusing Logical Relationship Information of Text in Neural Network for Text Classification

Mathematical Problems in Engineering ◽

10.1155/2020/5426795 ◽

2020 ◽

Vol 2020 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Heyong Wang ◽

Dehang Zeng

Keyword(s):

Neural Network ◽

Text Classification ◽

Information Science ◽

Classification Algorithms ◽

Human Beings ◽

Central Idea ◽

Logical Relationship ◽

The Relationship ◽

Different Parts ◽

Better Than

With the development of computer science and information science, text classification technology has been greatly developed and its application scenarios have been widened. In traditional process of text classification, the existing method will lose much logical relationship information of text. The logical relationship information of a text refers to the relationship information among different logical parts of the text, such as title, abstract, and body. When human beings are reading, they will take title as an important part to remind the central idea of the article, abstract as a brief summary of the content of the article, and body as a detailed description of the article. In most of the text classification studies, researchers concern more about the relationship among words (word frequency, semantics, etc.) and neglect the logical relationship information of text. It will lose information about the relationship among different parts (title, body, etc.) and have an influence on the performance of text classification. Therefore, we propose a text classification algorithm—fusing the logical relationship information of text in neural network (FLRIOTINN), which complements the logical relationship information into text classification algorithms. Experiments show that the effect of FLRIOTINN is better than the conventional backpropagation neural networks which does not consider the logical relationship information of text.

Download Full-text

Sentiment Classification Using Convolutional Neural Networks

Applied Sciences ◽

10.3390/app9112347 ◽

2019 ◽

Vol 9 (11) ◽

pp. 2347 ◽

Cited By ~ 18

Author(s):

Hannah Kim ◽

Young-Seob Jeong

Keyword(s):

Neural Network ◽

Neural Networks ◽

Convolutional Neural Networks ◽

Text Classification ◽

State Of The Art ◽

Sentiment Classification ◽

Learning Models ◽

Text Data ◽

Textual Data ◽

Better Than

As the number of textual data is exponentially increasing, it becomes more important to develop models to analyze the text data automatically. The texts may contain various labels such as gender, age, country, sentiment, and so forth. Using such labels may bring benefits to some industrial fields, so many studies of text classification have appeared. Recently, the Convolutional Neural Network (CNN) has been adopted for the task of text classification and has shown quite successful results. In this paper, we propose convolutional neural networks for the task of sentiment classification. Through experiments with three well-known datasets, we show that employing consecutive convolutional layers is effective for relatively longer texts, and our networks are better than other state-of-the-art deep learning models.

Download Full-text

QTG-Finder2: A Generalized Machine-Learning Algorithm for Prioritizing QTL Causal Genes in Plants

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401122 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2411-2421

Author(s):

Fan Lin ◽

Elena Z. Lazarus ◽

Seung Y. Rhee

Keyword(s):

Machine Learning ◽

Linkage Mapping ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Causal Gene ◽

Training Set ◽

Average Precision ◽

Trait Improvement ◽

Causal Genes ◽

Mapping Process

Linkage mapping has been widely used to identify quantitative trait loci (QTL) in many plants and usually requires a time-consuming and labor-intensive fine mapping process to find the causal gene underlying the QTL. Previously, we described QTG-Finder, a machine-learning algorithm to rationally prioritize candidate causal genes in QTLs. While it showed good performance, QTG-Finder could only be used in Arabidopsis and rice because of the limited number of known causal genes in other species. Here we tested the feasibility of enabling QTG-Finder to work on species that have few or no known causal genes by using orthologs of known causal genes as the training set. The model trained with orthologs could recall about 64% of Arabidopsis and 83% of rice causal genes when the top 20% ranked genes were considered, which is similar to the performance of models trained with known causal genes. The average precision was 0.027 for Arabidopsis and 0.029 for rice. We further extended the algorithm to include polymorphisms in conserved non-coding sequences and gene presence/absence variation as additional features. Using this algorithm, QTG-Finder2, we trained and cross-validated Sorghum bicolor and Setaria viridis models. The S. bicolor model was validated by causal genes curated from the literature and could recall 70% of causal genes when the top 20% ranked genes were considered. In addition, we applied the S. viridis model and public transcriptome data to prioritize a plant height QTL and identified 13 candidate genes. QTL-Finder2 can accelerate the discovery of causal genes in any plant species and facilitate agricultural trait improvement.

Download Full-text

Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement

Information ◽

10.3390/info11060332 ◽

2020 ◽

Vol 11 (6) ◽

pp. 332

Author(s):

Ernest Kwame Ampomah ◽

Zhiguang Qin ◽

Gabriel Nyame

Keyword(s):

Machine Learning ◽

Stock Market ◽

Stock Price ◽

Superior Performance ◽

Operating Characteristics ◽

Training Set ◽

Data Set ◽

Test Set ◽

Ensemble Machine Learning ◽

Better Than

Forecasting the direction and trend of stock price is an important task which helps investors to make prudent financial decisions in the stock market. Investment in the stock market has a big risk associated with it. Minimizing prediction error reduces the investment risk. Machine learning (ML) models typically perform better than statistical and econometric models. Also, ensemble ML models have been shown in the literature to be able to produce superior performance than single ML models. In this work, we compare the effectiveness of tree-based ensemble ML models (Random Forest (RF), XGBoost Classifier (XG), Bagging Classifier (BC), AdaBoost Classifier (Ada), Extra Trees Classifier (ET), and Voting Classifier (VC)) in forecasting the direction of stock price movement. Eight different stock data from three stock exchanges (NYSE, NASDAQ, and NSE) are randomly collected and used for the study. Each data set is split into training and test set. Ten-fold cross validation accuracy is used to evaluate the ML models on the training set. In addition, the ML models are evaluated on the test set using accuracy, precision, recall, F1-score, specificity, and area under receiver operating characteristics curve (AUC-ROC). Kendall W test of concordance is used to rank the performance of the tree-based ML algorithms. For the training set, the AdaBoost model performed better than the rest of the models. For the test set, accuracy, precision, F1-score, and AUC metrics generated results significant to rank the models, and the Extra Trees classifier outperformed the other models in all the rankings.

Download Full-text

DIALOG ACT CLASSIFICATION USING ACOUSTIC AND DISCOURSE INFORMATION OF MAPTASK DATA

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026810002926 ◽

2010 ◽

Vol 09 (04) ◽

pp. 289-311 ◽

Cited By ~ 2

Author(s):

FATEMA N. JULIA ◽

KHAN M. IFTEKHARUDDIN ◽

ATIQ U. ISLAM

Keyword(s):

Classifier Fusion ◽

Support Vector ◽

Acoustic Features ◽

Average Precision ◽

Data Set ◽

Parts Of Speech ◽

Pos Tagging ◽

Accuracy Rates ◽

Better Than

Dialog act (DA) classification is useful to understand the intentions of a human speaker. An effective classification of DA can be exploited for realistic implementation of expert systems. In this work, we investigate DA classification using both acoustic and discourse information for HCRC MapTask data. We extract several different acoustic features and exploit these features using a Hidden Markov Model (HMM) network to classify acoustic information. For discourse feature extraction, we propose a novel parts-of-speech (POS) tagging technique that effectively reduces the dimensionality of discourse features. To classify discourse information, we exploit two classifiers such as a HMM and Support Vector Machine (SVM). We further obtain classifier fusion between HMM and SVM to improve discourse classification. Finally, we perform an efficient decision-level classifier fusion for both acoustic and discourse information to classify 12 different DAs in MapTask data. We obtain 65.2% and 55.4% DA classification rates using acoustic and discourse information, respectively. Furthermore, we obtain combined accuracy of 68.6% for DA classification using both acoustic and discourse information. These accuracy rates of DA classification are either comparable or better than previously reported results for the same data set. For average precision and recall, we obtain accuracy rates of 74.89% and 69.83%, respectively. Therefore, we obtain much better precision and recall rates for most of the classified DAs when compared to existing works on the same HCRC MapTask data set.

Download Full-text

A Weakly Supervised and Deep Learning Method for an Additive Topic Analysis of Large Corpora

10.31235/osf.io/nfr3p ◽

2019 ◽

Cited By ~ 2

Author(s):

Yair Fogel-Dror ◽

Shaul R. Shenhav ◽

Tamir Sheafer

Keyword(s):

Content Analysis ◽

Deep Learning ◽

Text Classification ◽

Large Scale ◽

Low Cost ◽

Initial Number ◽

Analysis Method ◽

Training Set ◽

Topic Analysis ◽

Weakly Supervised

The collaborative effort of theory-driven content analysis can benefit significantly from the use of topic analysis methods, which allow researchers to add more categories while developing or testing a theory. This additive approach enables the reuse of previous efforts of analysis or even the merging of separate research projects, thereby making these methods more accessible and increasing the discipline’s ability to create and share content analysis capabilities. This paper proposes a weakly supervised topic analysis method that uses both a low-cost unsupervised method to compile a training set and supervised deep learning as an additive and accurate text classification method. We test the validity of the method, specifically its additivity, by comparing the results of the method after adding 200 categories to an initial number of 450. We show that the suggested method provides a foundation for a low-cost solution for large-scale topic analysis.

Download Full-text