Classification Methods

Author(s):  
Aijun An

Generally speaking, classification is the action of assigning an object to a category according to the characteristics of the object. In data mining, classification refers to the task of analyzing a set of pre-classified data objects to learn a model (or a function) that can be used to classify an unseen data object into one of several predefined classes. A data object, referred to as an example, is described by a set of attributes or variables. One of the attributes describes the class that an example belongs to and is thus called the class attribute or class variable. Other attributes are often called independent or predictor attributes (or variables). The set of examples used to learn the classification model is called the training data set. Tasks related to classification include regression, which builds a model from training data to predict numerical values, and clustering, which groups examples to form categories. Classification belongs to the category of supervised learning, distinguished from unsupervised learning. In supervised learning, the training data consists of pairs of input data (typically vectors), and desired outputs, while in unsupervised learning there is no a priori output. Classification has various applications, such as learning from a patient database to diagnose a disease based on the symptoms of a patient, analyzing credit card transactions to identify fraudulent transactions, automatic recognition of letters or digits based on handwriting samples, and distinguishing highly active compounds from inactive ones based on the structures of compounds for drug discovery.

Author(s):  
Aijun An

Generally speaking, classification is the action of assigning an object to a category according to the characteristics of the object. In data mining, classification refers to the task of analyzing a set of pre-classified data objects to learn a model (or a function) that can be used to classify an unseen data object into one of several predefined classes. A data object, referred to as an example, is described by a set of attributes or variables. One of the attributes describes the class that an example belongs to and is thus called the class attribute or class variable. Other attributes are often called independent or predictor attributes (or variables). The set of examples used to learn the classification model is called the training data set. Tasks related to classification include regression, which builds a model from training data to predict numerical values, and clustering, which groups examples to form categories. Classification belongs to the category of supervised learning, distinguished from unsupervised learning. In supervised learning, the training data consists of pairs of input data (typically vectors), and desired outputs, while in unsupervised learning there is no a priori output.


Author(s):  
J. L. ÁLVAREZ-MACÍAS ◽  
J. MATA-VÁZQUEZ ◽  
J. C. RIQUELME-SANTOS

In this paper we present a new method for the application of data mining tools on the management phase of software development process. Specifically, we describe two tools, the first one based on supervised learning, and the second one on unsupervised learning. The goal of this method is to induce a set of management rules that make easy the development process to the managers. Depending on how and to what is this method applied, it will permit an a priori analysis, a monitoring of the project or a post-mortem analysis.


2013 ◽  
Vol 427-429 ◽  
pp. 2309-2312
Author(s):  
Hai Bin Mei ◽  
Ming Hua Zhang

Alert classifiers built with the supervised classification technique require large amounts of labeled training alerts. Preparing for such training data is very difficult and expensive. Thus accuracy and feasibility of current classifiers are greatly restricted. This paper employs semi-supervised learning to build alert classification model to reduce the number of needed labeled training alerts. Alert context properties are also introduced to improve the classification performance. Experiments have demonstrated the accuracy and feasibility of our approach.


Author(s):  
Yu Wang

The requirement for having a labeled response variable in training data from the supervised learning technique may not be satisfied in some situations: particularly, in dynamic, short-term, and ad-hoc wireless network access environments. Being able to conduct classification without a labeled response variable is an essential challenge to modern network security and intrusion detection. In this chapter we will discuss some unsupervised learning techniques including probability, similarity, and multidimensional models that can be applied in network security. These methods also provide a different angle to analyze network traffic data. For comprehensive knowledge on unsupervised learning techniques please refer to the machine learning references listed in the previous chapter; for their applications in network security see Carmines, Edward & McIver (1981), Lane & Brodley (1997), Herrero, Corchado, Gastaldo, Leoncini, Picasso & Zunino (2007), and Dhanalakshmi & Babu (2008). Unlike in supervised learning, where for each vector 1 2 ( , , , ) n X x x x = ? we have a corresponding observed response, Y, in unsupervised learning we only have X, and Y is not available either because we could not observe it or its frequency is too low to be fit ted with a supervised learning approach. Unsupervised learning has great meanings in practice because in many circumstances, available network traffic data may not include any anomalous events or known anomalous events (e.g., traffics collected from a newly constructed network system). While high-speed mobile wireless and ad-hoc network systems have become popular, the importance and need to develop new unsupervised learning methods that allow the modeling of network traffic data to use anomaly-free training data have significantly increased.


2021 ◽  
Vol 22 (2) ◽  
Author(s):  
Chiheb Eddine Ben Ncir

Overlapping clustering is an important challenge in unsupervised learning applications while it allows for each data object to belong to more than one group. Several clustering methods were proposed to deal with this requirement by using several usual clustering approaches. Although the ability of these methods to detect non-disjoint partitioning, they fail when data contain groups with arbitrary and non-spherical shapes. We propose in this work a new density based overlapping clustering method, referred to as OC-DD, which is able to detect overlapping clusters even having non-spherical and complex shapes. The proposed method is based on the density and distances to detect dense regions in data while allowing for some data objects to belong to more than one group.Experiments performed on articial and real multi-labeled datasets have shown the effectiveness of the proposed method compared to the existing ones.


2021 ◽  
Vol 11 (8) ◽  
pp. 3509
Author(s):  
Edgar Jacob Rivera Rios ◽  
Miguel Angel Medina-Pérez ◽  
Manuel S. Lazo-Cortés ◽  
Raúl Monroy

Comparing data objects is at the heart of machine learning. For continuous data, object dissimilarity is usually taken to be object distance; however, for categorical data, there is no universal agreement, for categories can be ordered in several different ways. Most existing category dissimilarity measures characterize the distance among the values an attribute may take using precisely the number of different values the attribute takes (the attribute space) and the frequency at which they occur. These kinds of measures overlook attribute interdependence, which may provide valuable information when capturing per-attribute object dissimilarity. In this paper, we introduce a novel object dissimilarity measure that we call Learning-Based Dissimilarity, for comparing categorical data. Our measure characterizes the distance between two categorical values of a given attribute in terms of how likely it is that such values are confused or not when all the dataset objects with the remaining attributes are used to predict them. To that end, we provide an algorithm that, given a target attribute, first learns a classification model in order to compute a confusion matrix for the attribute. Then, our method transforms the confusion matrix into a per-attribute dissimilarity measure. We have successfully tested our measure against 55 datasets gathered from the University of California, Irvine (UCI) Machine Learning Repository. Our results show that it surpasses, in terms of various performance indicators for data clustering, the most prominent distance relations put forward in the literature.


Author(s):  
Jianzhou Feng ◽  
Jinman Cui ◽  
Qikai Wei ◽  
Zhengji Zhou ◽  
Yuxiong Wang

AbstractText classification is a research hotspot in the field of natural language processing. Existing text classification models based on supervised learning, especially deep learning models, have made great progress on public datasets. But most of these methods rely on a large amount of training data, and these datasets coverage is limited. In the legal intelligent question-answering system, accurate classification of legal consulting questions is a necessary prerequisite for the realization of intelligent question answering. However, due to lack of sufficient annotation data and the cost of labeling is high, which lead to the poor effect of traditional supervised learning methods under sparse labeling. In response to the above problems, we construct a few-shot legal consulting questions dataset, and propose a prototypical networks model based on multi-attention. For the same category of instances, this model first highlights the key features in the instances as much as possible through instance-dimension level attention. Then it realizes the classification of legal consulting questions by prototypical networks. Experimental results show that our model achieves state-of-the-art results compared with baseline models. The code and dataset are released on https://github.com/cjm0824/MAPN.


Sensors ◽  
2019 ◽  
Vol 19 (20) ◽  
pp. 4583 ◽  
Author(s):  
Xiaoqiang Liu ◽  
Yanming Chen ◽  
Shuyi Li ◽  
Liang Cheng ◽  
Manchun Li

Airborne laser scanning (ALS) can acquire both geometry and intensity information of geo-objects, which is important in mapping a large-scale three-dimensional (3D) urban environment. However, the intensity information recorded by ALS will be changed due to the flight height and atmospheric attenuation, which decreases the robustness of the trained supervised classifier. This paper proposes a hierarchical classification method by separately using geometry and intensity information of urban ALS data. The method uses supervised learning for stable geometry information and unsupervised learning for fluctuating intensity information. The experiment results show that the proposed method can utilize the intensity information effectively, based on three aspects, as below. (1) The proposed method improves the accuracy of classification result by using intensity. (2) When the ALS data to be classified are acquired under the same conditions as the training data, the performance of the proposed method is as good as the supervised learning method. (3) When the ALS data to be classified are acquired under different conditions from the training data, the performance of the proposed method is better than the supervised learning method. Therefore, the classification model derived from the proposed method can be transferred to other ALS data whose intensity is inconsistent with the training data. Furthermore, the proposed method can contribute to the hierarchical use of some other ALS information, such as multi-spectral information.


2021 ◽  
Vol 7 (2) ◽  
Author(s):  
Beth Coleman

In addressing the issue of harmful bias in AI systems, this paper asks for a consideration of a generatively wild AI that exceeds the framework of predictive machine learning. The argument places supervised learning with its labeled training data as primarily a form of reproduction of a status quo. Based on this framework, the paper moves through an analysis of two AI modalities—supervised learning (e.g., machine vision) and unsupervised learning (e.g., game play)—to demonstrate the potential of AI as mechanism that creates patterns of association outside of a purely reproductive condition. This analysis is followed by an introduction to the concept of the technology of the surround, where the paper then turns toward theoretical positions that unbind categorical logics, moving toward other possible positionalities—the surround (Harney and Moten), alien intelligence (Parisi), and intra-actions of subject/object resolution (Barad). The paper frames two key concepts in relation to an AI in the wild: the colonial sublime and black techné. The paper concludes with a summation of what AI in the wild can contribute to the subversion of technologies of oppression toward a liberatory potential of AI.


2021 ◽  
Author(s):  
Yusuke Sakai ◽  
Yousuke Itoh ◽  
Piljong Jung ◽  
Keiko Kokeyama ◽  
Chihiro Kozakai ◽  
...  

Abstract In the data of laser interferometric gravitational wave detectors, transient noise with non-stationary and non-Gaussian features occurs at a high rate. It often causes problems such as instability of the detector, hiding and/or imitating gravitational-wave signals. This transient noise has various characteristics in the time-frequency representation, which is considered to be associated with environmental and instrumental origins. Classification of transient noise can offer one of the clues for exploring its origin and improving the performance of the detector. One approach for the classification of these noises is supervised learning. However, generally, supervised learning requires annotation of the training data, and there are issues with ensuring objectivity in the classification and its corresponding new classes. On the contrary, unsupervised learning can reduce the annotation work for the training data and ensuring objectivity in the classification and its corresponding new classes. In this study, we propose an architecture for the classification of transient noise by using unsupervised learning, which combines a variational autoencoder and invariant information clustering. To evaluate the effectiveness of the proposed architecture, we used the dataset (time-frequency two-dimensional spectrogram images and labels) of the LIGO first observation run prepared by the Gravity Spy project. We obtain the consistency between the label annotated by Gravity spy project and the class provided by our proposed unsupervised learning architecture and provide the potential for the existence of the unrevealed classes.


Sign in / Sign up

Export Citation Format

Share Document