Pattern Synthesis in SVM Based Classifier

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch233 ◽

2011 ◽

pp. 1517-1523

Author(s):

C. Radha

Keyword(s):

Pattern Recognition ◽

Pattern Classification ◽

Discriminative Training ◽

Training Set ◽

Pattern Synthesis ◽

Test Set ◽

Generalization Performance ◽

Learning Techniques ◽

Training Examples ◽

The Given

An important problem in pattern recognition is that of pattern classification. The objective of classification is to determine a discriminant function which is consistent with the given training examples and performs reasonably well on an unlabeled test set of examples. The degree of performance of the classifier on the test examples, known as its generalization performance, is an important issue in the design of the classifier. It has been established that a good generalization performance can be achieved by providing the learner with a sufficiently large number of discriminative training examples. However, in many domains, it is infeasible or expensive to obtain a sufficiently large training set. Various mechanisms have been proposed in literature to combat this problem. Active Learning techniques (Angluin, 1998; Seung, Opper, & Sompolinsky, 1992) reduce the number of training examples required by carefully choosing discriminative training examples. Bootstrapping (Efron, 1979; Hamamoto, Uchimura & Tomita, 1997) and other pattern synthesis techniques generate a synthetic training set from the given training set. We present some of these techniques and propose some general mechanisms for pattern synthesis.

Download Full-text

Feature-Weighted Sampling for Proper Evaluation of Classification Models

Applied Sciences ◽

10.3390/app11052039 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2039

Author(s):

Hyunseok Shin ◽

Sejong Oh

Keyword(s):

Random Sampling ◽

Sampling Method ◽

Classification Model ◽

Training Set ◽

Test Set ◽

Feature Importance ◽

Proper Training ◽

Machine Learning Applications ◽

Test Sets ◽

The Given

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.

Download Full-text

Pattern Synthesis for Nonparametric Pattern Recognition

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch232 ◽

2011 ◽

pp. 1511-1516

Author(s):

P. Viswanath ◽

Narasimha M. Murty ◽

Bhatnagar Shalabh

Keyword(s):

Pattern Recognition ◽

Density Estimation ◽

Nearest Neighbor ◽

Curse Of Dimensionality ◽

Parametric Methods ◽

Pattern Synthesis ◽

Parzen Window ◽

Pattern Recognition Methods ◽

The Given ◽

Non Parametric

Parametric methods first choose the form of the model or hypotheses and estimates the necessary parameters from the given dataset. The form, which is chosen, based on experience or domain knowledge, often, need not be the same thing as that which actually exists (Duda, Hart & Stork, 2000). Further, apart from being highly error-prone, this type of methods shows very poor adaptability for dynamically changing datasets. On the other hand, non-parametric pattern recognition methods are attractive because they do not derive any model, but works with the given dataset directly. These methods are highly adaptive for dynamically changing datasets. Two widely used non-parametric pattern recognition methods are (a) the nearest neighbor based classification and (b) the Parzen-Window based density estimation (Duda, Hart & Stork, 2000). Two major problems in applying the non-parametric methods, especially, with large and high dimensional datasets are (a) the high computational requirements and (b) the curse of dimensionality (Duda, Hart & Stork, 2000). Algorithmic improvements, approximate methods can solve the first problem whereas feature selection (Isabelle Guyon & André Elisseeff, 2003), feature extraction (Terabe, Washio, Motoda, Katai & Sawaragi, 2002) and bootstrapping techniques (Efron, 1979; Hamamoto, Uchimura & Tomita, 1997) can tackle the second problem. We propose a novel and unified solution for these problems by deriving a compact and generalized abstraction of the data. By this term, we mean a compact representation of the given patterns from which one can retrieve not only the original patterns but also some artificial patterns. The compactness of the abstraction reduces the computational requirements, and its generalization reduces the curse of dimensionality effect. Pattern synthesis techniques accompanied with compact representations attempt to derive compact and generalized abstractions of the data. These techniques are applied with (a) the nearest neighbor classifier (NNC) which is a popular non-parametric classifier used in many fields including data mining since its conception in the early fifties (Dasarathy, 2002) and (b) the Parzen-Window based density estimation which is a well known non-parametric density estimation method (Duda, Hart & Stork, 2000).

Download Full-text

ACCELERATED LEARNING BY ACTIVE EXAMPLE SELECTION

International Journal of Neural Systems ◽

10.1142/s0129065794000086 ◽

1994 ◽

Vol 05 (01) ◽

pp. 67-75 ◽

Cited By ~ 32

Author(s):

BYOUNG-TAK ZHANG

Keyword(s):

Neural Networks ◽

Gradient Descent ◽

Learning Algorithm ◽

Accelerated Learning ◽

Training Set ◽

Alternative Approach ◽

Speed Up ◽

Multilayer Neural Networks ◽

Training Examples ◽

The Given

Much previous work on training multilayer neural networks has attempted to speed up the backpropagation algorithm using more sophisticated weight modification rules, whereby all the given training examples are used in a random or predetermined sequence. In this paper we investigate an alternative approach in which the learning proceeds on an increasing number of selected training examples, starting with a small training set. We derive a measure of criticality of examples and present an incremental learning algorithm that uses this measure to select a critical subset of given examples for solving the particular task. Our experimental results suggest that the method can significantly improve training speed and generalization performance in many real applications of neural networks. This method can be used in conjunction with other variations of gradient descent algorithms.

Download Full-text

A HYBRID FRAMEWORK COMBINING STRUCTURAL AND DECISION-THEORETIC PATTERN RECOGNITION AND APPLICATIONS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001489000152 ◽

1989 ◽

Vol 03 (02) ◽

pp. 181-206 ◽

Cited By ~ 3

Author(s):

ENRIQUE VIDAL ◽

FRANCISCO CASACUBERTA

Keyword(s):

Pattern Recognition ◽

Inductive Learning ◽

Extended Techniques ◽

Discriminant Functions ◽

Structural Classification ◽

Linear Discriminant ◽

Classical Formulation ◽

Learning Techniques ◽

Classification Tasks ◽

The Given

A new framework is introduced which allows the formulation of difficult structural classification tasks in terms of decision-theoretic-based pattern recognition. It is based on extending the classical formulation of generalized linear discriminant functions so as to permit each given object to have a different vector representation in each class. The proposed extension properly accounts for the corresponding extension of the classical learning techniques of linear discriminant functions in a way such that the convergence of the extended techniques can still be proved. The proposed framework can be considered as a hybrid methodology in which both structural and decision-theoretic pattern recognition are integrated. Furthermore, it can be considered as a means to achieve convenient tradeoffs between the inductive and deductive ways of knowledge acquisition, which can result in rendering tractable the possibly hard original inductive learning problem associated with the given task. The proposed framework and methods are illustrated through their use in two difficult structural classification tasks, showing both the appropriateness and the capability of these methods to obtain useful results.

Download Full-text

Pattern Synthesis for Large-Scale Pattern Recognition

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch170 ◽

2011 ◽

pp. 902-905

Author(s):

P. Viswanath ◽

M. Narasimha Murty ◽

Shalabh Bhatnagar

Keyword(s):

Pattern Recognition ◽

Large Scale ◽

Nearest Neighbor ◽

Curse Of Dimensionality ◽

Compact Representation ◽

Pattern Synthesis ◽

Approximate Methods ◽

Compact Representations ◽

The Given ◽

Neighbor Classifier

Two major problems in applying any pattern recognition technique for large and high-dimensional data are (a) high computational requirements and (b) curse of dimensionality (Duda, Hart, & Stork, 2000). Algorithmic improvements and approximate methods can solve the first problem, whereas feature selection (Guyon & Elisseeff, 2003), feature extraction (Terabe, Washio, Motoda, Katai, & Sawaragi, 2002), and bootstrapping techniques (Efron, 1979; Hamamoto, Uchimura, & Tomita, 1997) can tackle the second problem. We propose a novel and unified solution for these problems by deriving a compact and generalized abstraction of the data. By this term, we mean a compact representation of the given patterns from which one can retrieve not only the original patterns but also some artificial patterns. The compactness of the abstraction reduces the computational requirements, and its generalization reduces the curse of dimensionality effect. Pattern synthesis techniques accompanied with compact representations attempt to derive compact and generalized abstractions of the data. These techniques are applied with nearest neighbor classifier (NNC), which is a popular nonparametric classifier used in many fields, including data mining, since its conception in the early 1950s (Dasarathy, 2002).

Download Full-text

Pattern Recognition in Histo-Pathology: Basic Considerations

Methods of Information in Medicine ◽

10.1055/s-0038-1635387 ◽

1982 ◽

Vol 21 (01) ◽

pp. 15-22 ◽

Cited By ~ 10

Author(s):

W. Schlegel ◽

K. Kayser

Keyword(s):

Pattern Recognition ◽

Graph Theory ◽

Normal Tissue ◽

Diagnostic Procedures ◽

Minimal Distance ◽

Malignant Growth ◽

Colon Tissue ◽

First Results ◽

Tissue Structures ◽

The Given

A basic concept for the automatic diagnosis of histo-pathological specimen is presented. The algorithm is based on tissue structures of the original organ. Low power magnification was used to inspect the specimens. The form of the given tissue structures, e. g. diameter, distance, shape factor and number of neighbours, is measured. Graph theory is applied by using the center of structures as vertices and the shortest connection of neighbours as edges. The algorithm leads to two independent sets of parameters which can be used for diagnostic procedures. First results with colon tissue show significant differences between normal tissue, benign and malignant growth. Polyps form glands that are twice as wide as normal and carcinomatous tissue. Carcinomas can be separated by the minimal distance of the glands formed. First results of pattern recognition using graph theory are discussed.

Download Full-text

PREDIKSI KUALITAS AIR SUNGAI CILIWUNG DENGAN MENGGUNAKAN ALGORITMA POHON KEPUTUSAN

Jurnal Air Indonesia ◽

10.29122/jai.v12i2.4364 ◽

2021 ◽

Vol 12 (2) ◽

Author(s):

Mohammad Haekal ◽

Henki Bayu Seta ◽

Mayanda Mega Santoni

Keyword(s):

Data Mining ◽

Decision Tree ◽

Cross Validation ◽

Online Monitoring ◽

Training Set ◽

Microsoft Excel ◽

Test Set

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.

Download Full-text

QSPR modelling of the octanol/water partition coefficient of organometallic substances by optimal SMILES-based descriptors

Open Chemistry ◽

10.2478/s11532-009-0095-y ◽

2009 ◽

Vol 7 (4) ◽

pp. 846-856 ◽

Cited By ~ 6

Author(s):

Andrey Toropov ◽

Alla Toropova ◽

Emilio Benfenati

Keyword(s):

Partition Coefficient ◽

Organometallic Compounds ◽

Applicability Domain ◽

Training Set ◽

Input Line ◽

Test Set ◽

Water Partition Coefficient ◽

Definition Of

AbstractUsually, QSPR is not used to model organometallic compounds. We have modeled the octanol/water partition coefficient for organometallic compounds of Na, K, Ca, Cu, Fe, Zn, Ni, As, and Hg by optimal descriptors calculated with simplified molecular input line entry system (SMILES) notations. The best model is characterized by the following statistics: n=54, r2=0.9807, s=0.677, F=2636 (training set); n=26, r2=0.9693, s=0.969, F=759 (test set). Empirical criteria for the definition of the applicability domain for these models are discussed.

Download Full-text