Semi-Supervised Classification and its Application to Filtering IDS False Positives

Generally speaking, classification is the action of assigning an object to a category according to the characteristics of the object. In data mining, classification refers to the task of analyzing a set of pre-classified data objects to learn a model (or a function) that can be used to classify an unseen data object into one of several predefined classes. A data object, referred to as an example, is described by a set of attributes or variables. One of the attributes describes the class that an example belongs to and is thus called the class attribute or class variable. Other attributes are often called independent or predictor attributes (or variables). The set of examples used to learn the classification model is called the training data set. Tasks related to classification include regression, which builds a model from training data to predict numerical values, and clustering, which groups examples to form categories. Classification belongs to the category of supervised learning, distinguished from unsupervised learning. In supervised learning, the training data consists of pairs of input data (typically vectors), and desired outputs, while in unsupervised learning there is no a priori output. Classification has various applications, such as learning from a patient database to diagnose a disease based on the symptoms of a patient, analyzing credit card transactions to identify fraudulent transactions, automatic recognition of letters or digits based on handwriting samples, and distinguishing highly active compounds from inactive ones based on the structures of compounds for drug discovery.

Download Full-text

Semi-Supervised Learning

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch192 ◽

2011 ◽

pp. 1022-1027

Author(s):

Tobias Scheffer

Keyword(s):

Supervised Learning ◽

Supervised Classification ◽

Unlabeled Data ◽

Training Data ◽

Classification Algorithms ◽

Classification Problems

For many classification problems, unlabeled training data are inexpensive and readily available, whereas labeling training data imposes costs. Semi-supervised classification algorithms aim at utilizing information contained in unlabeled data in addition to the (few) labeled data.

Download Full-text

A novel image classification model based on adversarial training for pulsar candidate identification

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-200925 ◽

2020 ◽

Vol 39 (5) ◽

pp. 7657-7669

Author(s):

Linyong Zhou ◽

Shanping You ◽

Bimo Ren ◽

Xuhong Yu ◽

Xiaoyao Xie

Keyword(s):

Image Recognition ◽

Optimal Solution ◽

Poor Performance ◽

Classification Performance ◽

Training Data ◽

Classification Model ◽

Model Based ◽

Unseen Data ◽

Adversarial Training ◽

Candidate Identification

Pulsars are highly magnetized, rotating neutron stars with small volume and high density. The discovery of pulsars is of great significance in the fields of physics and astronomy. With the development of artificial intelligent, image recognition models based on deep learning are increasingly utilized for pulsar candidate identification. However, pulsar candidate datasets are characterized by unbalance and lack of positive samples, which has contributed the traditional methods to fall into poor performance and model bias. To this end, a general image recognition model based on adversarial training is proposed. A generator, a classifier, and two discriminators are included in the model. Theoretical analysis demonstrates that the model has a unique optimal solution, and the classifier happens to be the inference network of the generator. Therefore, the samples produced by the generator significantly augment the diversity of training data. When the model reaches equilibrium, it can not only predict labels for unseen data, but also generate controllable samples. In experiments, we split part of data from MNIST for training. The results reveal that the model not only behaves better classification performance than CNN, but also has better controllability than CGAN and ACGAN. Then, the model is applied to pulsar candidate dataset HTRU and FAST. The results exhibit that, compared with CNN model, the F-score has increased by 1.99% and 3.67%, and the Recall has also increased by 6.28% and 8.59% respectively.

Download Full-text

Research on the Entity Relation Extraction of Field Based on Semi-Supervised

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.225-226.1292 ◽

2011 ◽

Vol 225-226 ◽

pp. 1292-1300

Author(s):

Jian Yi Guo ◽

Jun Zhao ◽

Zheng Tao Yu ◽

Lei Su ◽

Yan Tuan Xian ◽

...

Keyword(s):

Supervised Learning ◽

Information Entropy ◽

Rate Increase ◽

Relation Extraction ◽

Classification Performance ◽

Training Data ◽

Small Scale ◽

Result Show ◽

Entity Relation Extraction ◽

Precision Rate

Aim at the problem of supervised learning needing much labeled data for training, this paper proposes a new method based on Semi-Supervised learning. Firstly, to construct a classifier of certain accuracy, small-scale training data was used.Secondly, with the self-expanding idea, we applied the method of information entropy to select some new instances of higher credibility from candidate instances, which were to be predicted by the classifier. Finally, with the expansion of training data, training classifier re-iteratively, classification performance tended to be stable iteration termination, which achieved the entity relation extraction of tourism field by semi-supervised learning. The experiments result show that the new classifier which applies information entropy to iteratively expand training data to be trained makes the precision rate increase by 7% and the F-score increase by 15%.

Download Full-text

Developing Sustainable Classification of Diseases via Deep Learning and Semi-Supervised Learning

Healthcare ◽

10.3390/healthcare8030291 ◽

2020 ◽

Vol 8 (3) ◽

pp. 291 ◽

Cited By ~ 1

Author(s):

Chunwu Yin ◽

Zhanbo Chen

Keyword(s):

Deep Learning ◽

Supervised Learning ◽

Learning Style ◽

Classification Performance ◽

Disease Classification ◽

Training Data ◽

Training Procedure ◽

Training Approach ◽

Classification Of Diseases ◽

Deep Forest

Disease classification based on machine learning has become a crucial research topic in the fields of genetics and molecular biology. Generally, disease classification involves a supervised learning style; i.e., it requires a large number of labelled samples to achieve good classification performance. However, in the majority of the cases, labelled samples are hard to obtain, so the amount of training data are limited. However, many unclassified (unlabelled) sequences have been deposited in public databases, which may help the training procedure. This method is called semi-supervised learning and is very useful in many applications. Self-training can be implemented using high- to low-confidence samples to prevent noisy samples from affecting the robustness of semi-supervised learning in the training process. The deep forest method with the hyperparameter settings used in this paper can achieve excellent performance. Therefore, in this work, we propose a novel combined deep learning model and semi-supervised learning with self-training approach to improve the performance in disease classification, which utilizes unlabelled samples to update a mechanism designed to increase the number of high-confidence pseudo-labelled samples. The experimental results show that our proposed model can achieve good performance in disease classification and disease-causing gene identification.

Download Full-text

Assessments of Feature Selection Techniques with Respect to Data Sampling for Highly Imbalanced Software Measurement Data

International Journal of Reliability Quality and Safety Engineering ◽

10.1142/s0218539315500102 ◽

2015 ◽

Vol 22 (02) ◽

pp. 1550010 ◽

Cited By ~ 1

Author(s):

Kehan Gao ◽

Taghi M. Khoshgoftaar

Keyword(s):

Feature Selection ◽

Measurement Data ◽

Classification Performance ◽

Training Data ◽

Classification Model ◽

Sampling Techniques ◽

Data Sampling ◽

Software Measurement ◽

Data Set ◽

The Stability

In the process of software defect prediction, a classification model is first built using software metrics and fault data gathered from a past software development project, then that model is applied to data in a similar project or a new release of the same project to predict new program modules as either fault-prone (fp) or not-fault-prone (nfp). The benefit of such a model is to facilitate the optimal use of limited financial and human resources for software testing and inspection. The predictive power of a classification model constructed from a given data set is affected by many factors. In this paper, we are more interested in two problems that often arise in software measurement data: high dimensionality and unequal example set size of the two types of modules (e.g., many more nfp modules than fp modules found in a data set). These directly result in learning time extension and a decline in predictive performance of classification models. We consider using data sampling followed by feature selection (FS) to deal with these problems. Six data sampling strategies (which are made up of three sampling techniques, each consisting of two post-sampling proportion ratios) and six commonly used feature ranking approaches are employed in this study. We evaluate the FS techniques by means of: (1) a general method, i.e., assessing the classification performance after the training data is modified, and (2) studying the stability of a FS method, specifically with the goal of understanding the effect of data sampling techniques on the stability of FS when using the sampled data. The experiments were performed on nine data sets from a real-world software project. The results demonstrate that the FS techniques that most enhance the models' classification performance do not also show the best stability, and vice versa. In addition, the classification performance is more affected by the sampling techniques themselves rather than by the post-sampling proportions, whereas this is opposite for the stability.

Download Full-text

Classification Methods

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch028 ◽

2011 ◽

pp. 144-149 ◽

Cited By ~ 1

Author(s):

Aijun An

Keyword(s):

Unsupervised Learning ◽

Supervised Learning ◽

A Priori ◽

Training Data ◽

Classification Model ◽

Data Set ◽

Data Object ◽

Unseen Data ◽

Data Objects ◽

Class Variable

Generally speaking, classification is the action of assigning an object to a category according to the characteristics of the object. In data mining, classification refers to the task of analyzing a set of pre-classified data objects to learn a model (or a function) that can be used to classify an unseen data object into one of several predefined classes. A data object, referred to as an example, is described by a set of attributes or variables. One of the attributes describes the class that an example belongs to and is thus called the class attribute or class variable. Other attributes are often called independent or predictor attributes (or variables). The set of examples used to learn the classification model is called the training data set. Tasks related to classification include regression, which builds a model from training data to predict numerical values, and clustering, which groups examples to form categories. Classification belongs to the category of supervised learning, distinguished from unsupervised learning. In supervised learning, the training data consists of pairs of input data (typically vectors), and desired outputs, while in unsupervised learning there is no a priori output.

Download Full-text

Weakly Supervised Fine-Grained Image Classification via Salient Region Localization and Different Layer Feature Fusion

Applied Sciences ◽

10.3390/app10134652 ◽

2020 ◽

Vol 10 (13) ◽

pp. 4652

Author(s):

Fangxiong Chen ◽

Guoheng Huang ◽

Jiaying Lan ◽

Yanhui Wu ◽

Chi-Man Pun ◽

...

Keyword(s):

Image Classification ◽

Feature Fusion ◽

Classification Performance ◽

Training Data ◽

Classification Model ◽

Global Features ◽

Salient Region ◽

Fine Grained ◽

Proposed Model ◽

Weakly Supervised

The fine-grained image classification task is about differentiating between different object classes. The difficulties of the task are large intra-class variance and small inter-class variance. For this reason, improving models’ accuracies on the task heavily relies on discriminative parts’ annotations and regional parts’ annotations. Such delicate annotations’ dependency causes the restriction on models’ practicability. To tackle this issue, a saliency module based on a weakly supervised fine-grained image classification model is proposed by this article. Through our salient region localization module, the proposed model can localize essential regional parts with the use of saliency maps, while only image class annotations are provided. Besides, the bilinear attention module can improve the performance on feature extraction by using higher- and lower-level layers of the network to fuse regional features with global features. With the application of the bilinear attention architecture, we propose the different layer feature fusion module to improve the expression ability of model features. We tested and verified our model on public datasets released specifically for fine-grained image classification. The results of our test show that our proposed model can achieve close to state-of-the-art classification performance on various datasets, while only the least training data are provided. Such a result indicates that the practicality of our model is incredibly improved since fine-grained image datasets are expensive.

Download Full-text

Neural labeled LDA: a topic model for semi-supervised document classification

10.21203/rs.3.rs-171202/v1 ◽

2021 ◽

Author(s):

Wei Wang ◽

Bing Guo ◽

Yan Shen ◽

Han Yang ◽

Yaosen Chen ◽

...

Keyword(s):

Supervised Learning ◽

Topic Modeling ◽

Supervised Classification ◽

Topic Model ◽

Classification Performance ◽

Document Classification ◽

Classification Problems ◽

Modeling Approaches ◽

Proposed Model ◽

Density Assumption

Abstract Recently, some statistical topic modeling approaches based on LDA have been applied in the field of supervised document classification, where the model generation procedure incorporates prior knowledge to improve the classification performance. However, these customizations of topic modeling are limited by the cumbersome derivation of a specific inference algorithm for each modification. In this paper, we propose a new supervised topic modeling approach for document classification problems, Neural Labeled LDA (NL-LDA), which builds on the VAE framework, and designs a special generative network to incorporate prior information. The proposed model can support semi-supervised learning based on the manifold assumption and low-density assumption. Meanwhile, NL-LDA has a consistent and concise inference method while semi-supervised learning and predicting. Quantitative experimental results demonstrate our model has outstanding performance on supervised document classification relative to the compared approaches, including traditional statistical and neural topic models. Specially, the proposed model can support both single-label and multi-label document classification. The proposed NL-LDA performs significantly well on semi-supervised classification, especially under a small amount of labeled data. Further comparisons with related works also indicate our model is competitive with state-of-the-art topic modeling approaches on semi-supervised classification.

Download Full-text

A Classification Model of Legal Consulting Questions Based on Multi-Attention Prototypical Networks

International Journal of Computational Intelligence Systems ◽

10.1007/s44196-021-00053-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Jianzhou Feng ◽

Jinman Cui ◽

Qikai Wei ◽

Zhengji Zhou ◽

Yuxiong Wang

Keyword(s):

Supervised Learning ◽

Language Processing ◽

Text Classification ◽

Question Answering ◽

Training Data ◽

Classification Model ◽

Great Progress ◽

Public Datasets ◽

The Cost

AbstractText classification is a research hotspot in the field of natural language processing. Existing text classification models based on supervised learning, especially deep learning models, have made great progress on public datasets. But most of these methods rely on a large amount of training data, and these datasets coverage is limited. In the legal intelligent question-answering system, accurate classification of legal consulting questions is a necessary prerequisite for the realization of intelligent question answering. However, due to lack of sufficient annotation data and the cost of labeling is high, which lead to the poor effect of traditional supervised learning methods under sparse labeling. In response to the above problems, we construct a few-shot legal consulting questions dataset, and propose a prototypical networks model based on multi-attention. For the same category of instances, this model first highlights the key features in the instances as much as possible through instance-dimension level attention. Then it realizes the classification of legal consulting questions by prototypical networks. Experimental results show that our model achieves state-of-the-art results compared with baseline models. The code and dataset are released on https://github.com/cjm0824/MAPN.

Download Full-text