Application on small sample classification of ER rule classifier

Abstract With the advent of powerful telescopes such as the Square Kilometer Array and the Vera C. Rubin Observatory, we are entering an era of multiwavelength transient astronomy that will lead to a dramatic increase in data volume. Machine learning techniques are well suited to address this data challenge and rapidly classify newly detected transients. We present a multiwavelength classification algorithm consisting of three steps: (1) interpolation and augmentation of the data using Gaussian processes; (2) feature extraction using wavelets; (3) classification with random forests. Augmentation provides improved performance at test time by balancing the classes and adding diversity into the training set. In the first application of machine learning to the classification of real radio transient data, we apply our technique to the Green Bank Interferometer and other radio light curves. We find we are able to accurately classify most of the eleven classes of radio variables and transients after just eight hours of observations, achieving an overall test accuracy of 78%. We fully investigate the impact of the small sample size of 82 publicly available light curves and use data augmentation techniques to mitigate the effect. We also show that on a significantly larger simulated representative training set that the algorithm achieves an overall accuracy of 97%, illustrating that the method is likely to provide excellent performance on future surveys. Finally, we demonstrate the effectiveness of simultaneous multiwavelength observations by showing how incorporating just one optical data point into the analysis improves the accuracy of the worst performing class by 19%.

Download Full-text

An Ensemble Classification Method for High-Dimensional Data Using Neighborhood Rough Set

Complexity ◽

10.1155/2021/8358921 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Jing Zhang ◽

Guang Lu ◽

Jiaquan Li ◽

Chuanwen Li

Keyword(s):

Feature Selection ◽

Rough Set ◽

Small Sample Size ◽

High Dimensional Data ◽

Classification Performance ◽

Small Sample ◽

Ensemble Classification ◽

High Dimensional ◽

Sample Classification ◽

Neighborhood Rough Set

Mining useful knowledge from high-dimensional data is a hot research topic. Efficient and effective sample classification and feature selection are challenging tasks due to high dimensionality and small sample size of microarray data. Feature selection is necessary in the process of constructing the model to reduce time and space consumption. Therefore, a feature selection model based on prior knowledge and rough set is proposed. Pathway knowledge is used to select feature subsets, and rough set based on intersection neighborhood is then used to select important feature in each subset, since it can select features without redundancy and deals with numerical features directly. In order to improve the diversity among base classifiers and the efficiency of classification, it is necessary to select part of base classifiers. Classifiers are grouped into several clusters by k-means clustering using the proposed combination distance of Kappa-based diversity and accuracy. The base classifier with the best classification performance in each cluster will be selected to generate the final ensemble model. Experimental results on three Arabidopsis thaliana stress response datasets showed that the proposed method achieved better classification performance than existing ensemble models.

Download Full-text

COMBINING GENERALIZED NMF AND DISCRIMINATIVE MIXTURE MODELS FOR CLASSIFICATION OF GENE EXPRESSION DATA

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001408006892 ◽

2008 ◽

Vol 22 (08) ◽

pp. 1587-1598 ◽

Cited By ~ 3

Author(s):

WEIXIANG LIU ◽

KEHONG YUAN ◽

JIAN WU ◽

DATIAN YE ◽

ZHEN JI ◽

...

Keyword(s):

Gene Expression ◽

Mixture Model ◽

Gene Expression Data ◽

Small Sample Size ◽

Data Classification ◽

Small Sample ◽

Training Data ◽

Microarray Data Analysis ◽

Expression Data

Classification of gene expression samples is a core task in microarray data analysis. How to reduce thousands of genes and to select a suitable classifier are two key issues for gene expression data classification. This paper introduces a framework on combining both feature extraction and classifier simultaneously. Considering the non-negativity, high dimensionality and small sample size, we apply a discriminative mixture model which is designed for non-negative gene express data classification via non-negative matrix factorization (NMF) for dimension reduction. In order to enhance the sparseness of training data for fast learning of the mixture model, a generalized NMF is also adopted. Experimental results on several real gene expression datasets show that the classification accuracy, stability and decision quality can be significantly improved by using the generalized method, and the proposed method can give better performance than some previous reported results on the same datasets.

Download Full-text

Analysing the features of negative sentiment tweets

The Electronic Library ◽

10.1108/el-05-2017-0120 ◽

2018 ◽

Vol 36 (5) ◽

pp. 782-799 ◽

Cited By ~ 4

Author(s):

Ling Zhang ◽

Wei Dong ◽

Xiangming Mu

Keyword(s):

Research Area ◽

Small Sample ◽

Negative Word ◽

Social Media Data ◽

Content Type ◽

Word Use ◽

Part Of Speech ◽

Negative Sentiment ◽

Media Data

Purpose This paper aims to address the challenge of analysing the features of negative sentiment tweets. The method adopted in this paper elucidates the classification of social network documents and paves the way for sentiment analysis of tweets in further research. Design/methodology/approach This study classifies negative tweets and analyses their features. Findings Through negative tweet content analysis, tweets are divided into ten topics. Many related words and negative words were found. Some indicators of negative word use could reflect the degree to which users release negative emotions: part of speech, the density and frequency of negative words and negative word distribution. Furthermore, the distribution of negative words obeys Zipf’s law. Research limitations/implications This study manually analysed only a small sample of negative tweets. Practical implications The research explored how many categories of negative sentiment tweets there are on Twitter. Related words are helpful to construct an ontology of tweets, which helps people with information retrieval in a fixed research area. The analysis of extracted negative words determined the features of negative tweets, which is useful to detect the polarity of tweets by machine learning method. Originality/value The research provides an initial exploration of a negative document classification method and classifies the negative tweets into ten topics. By analysing the features of negative tweets, related words, negative words, the density of negative words, etc. are presented. This work is the first step to extend Plutchik’s emotion wheel theory into social media data analysis by constructing filed specific thesauri, referred to as local sentimental thesauri.

Download Full-text