Pattern Synthesis in SVM Based Classifier

Author(s):  
C. Radha

An important problem in pattern recognition is that of pattern classification. The objective of classification is to determine a discriminant function which is consistent with the given training examples and performs reasonably well on an unlabeled test set of examples. The degree of performance of the classifier on the test examples, known as its generalization performance, is an important issue in the design of the classifier. It has been established that a good generalization performance can be achieved by providing the learner with a sufficiently large number of discriminative training examples. However, in many domains, it is infeasible or expensive to obtain a sufficiently large training set. Various mechanisms have been proposed in literature to combat this problem. Active Learning techniques (Angluin, 1998; Seung, Opper, & Sompolinsky, 1992) reduce the number of training examples required by carefully choosing discriminative training examples. Bootstrapping (Efron, 1979; Hamamoto, Uchimura & Tomita, 1997) and other pattern synthesis techniques generate a synthetic training set from the given training set. We present some of these techniques and propose some general mechanisms for pattern synthesis.

2021 ◽  
Vol 11 (5) ◽  
pp. 2039
Author(s):  
Hyunseok Shin ◽  
Sejong Oh

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.


Author(s):  
P. Viswanath ◽  
Narasimha M. Murty ◽  
Bhatnagar Shalabh

Parametric methods first choose the form of the model or hypotheses and estimates the necessary parameters from the given dataset. The form, which is chosen, based on experience or domain knowledge, often, need not be the same thing as that which actually exists (Duda, Hart & Stork, 2000). Further, apart from being highly error-prone, this type of methods shows very poor adaptability for dynamically changing datasets. On the other hand, non-parametric pattern recognition methods are attractive because they do not derive any model, but works with the given dataset directly. These methods are highly adaptive for dynamically changing datasets. Two widely used non-parametric pattern recognition methods are (a) the nearest neighbor based classification and (b) the Parzen-Window based density estimation (Duda, Hart & Stork, 2000). Two major problems in applying the non-parametric methods, especially, with large and high dimensional datasets are (a) the high computational requirements and (b) the curse of dimensionality (Duda, Hart & Stork, 2000). Algorithmic improvements, approximate methods can solve the first problem whereas feature selection (Isabelle Guyon & André Elisseeff, 2003), feature extraction (Terabe, Washio, Motoda, Katai & Sawaragi, 2002) and bootstrapping techniques (Efron, 1979; Hamamoto, Uchimura & Tomita, 1997) can tackle the second problem. We propose a novel and unified solution for these problems by deriving a compact and generalized abstraction of the data. By this term, we mean a compact representation of the given patterns from which one can retrieve not only the original patterns but also some artificial patterns. The compactness of the abstraction reduces the computational requirements, and its generalization reduces the curse of dimensionality effect. Pattern synthesis techniques accompanied with compact representations attempt to derive compact and generalized abstractions of the data. These techniques are applied with (a) the nearest neighbor classifier (NNC) which is a popular non-parametric classifier used in many fields including data mining since its conception in the early fifties (Dasarathy, 2002) and (b) the Parzen-Window based density estimation which is a well known non-parametric density estimation method (Duda, Hart & Stork, 2000).


1994 ◽  
Vol 05 (01) ◽  
pp. 67-75 ◽  
Author(s):  
BYOUNG-TAK ZHANG

Much previous work on training multilayer neural networks has attempted to speed up the backpropagation algorithm using more sophisticated weight modification rules, whereby all the given training examples are used in a random or predetermined sequence. In this paper we investigate an alternative approach in which the learning proceeds on an increasing number of selected training examples, starting with a small training set. We derive a measure of criticality of examples and present an incremental learning algorithm that uses this measure to select a critical subset of given examples for solving the particular task. Our experimental results suggest that the method can significantly improve training speed and generalization performance in many real applications of neural networks. This method can be used in conjunction with other variations of gradient descent algorithms.


Author(s):  
ENRIQUE VIDAL ◽  
FRANCISCO CASACUBERTA

A new framework is introduced which allows the formulation of difficult structural classification tasks in terms of decision-theoretic-based pattern recognition. It is based on extending the classical formulation of generalized linear discriminant functions so as to permit each given object to have a different vector representation in each class. The proposed extension properly accounts for the corresponding extension of the classical learning techniques of linear discriminant functions in a way such that the convergence of the extended techniques can still be proved. The proposed framework can be considered as a hybrid methodology in which both structural and decision-theoretic pattern recognition are integrated. Furthermore, it can be considered as a means to achieve convenient tradeoffs between the inductive and deductive ways of knowledge acquisition, which can result in rendering tractable the possibly hard original inductive learning problem associated with the given task. The proposed framework and methods are illustrated through their use in two difficult structural classification tasks, showing both the appropriateness and the capability of these methods to obtain useful results.


Author(s):  
P. Viswanath ◽  
M. Narasimha Murty ◽  
Shalabh Bhatnagar

Two major problems in applying any pattern recognition technique for large and high-dimensional data are (a) high computational requirements and (b) curse of dimensionality (Duda, Hart, & Stork, 2000). Algorithmic improvements and approximate methods can solve the first problem, whereas feature selection (Guyon & Elisseeff, 2003), feature extraction (Terabe, Washio, Motoda, Katai, & Sawaragi, 2002), and bootstrapping techniques (Efron, 1979; Hamamoto, Uchimura, & Tomita, 1997) can tackle the second problem. We propose a novel and unified solution for these problems by deriving a compact and generalized abstraction of the data. By this term, we mean a compact representation of the given patterns from which one can retrieve not only the original patterns but also some artificial patterns. The compactness of the abstraction reduces the computational requirements, and its generalization reduces the curse of dimensionality effect. Pattern synthesis techniques accompanied with compact representations attempt to derive compact and generalized abstractions of the data. These techniques are applied with nearest neighbor classifier (NNC), which is a popular nonparametric classifier used in many fields, including data mining, since its conception in the early 1950s (Dasarathy, 2002).


1982 ◽  
Vol 21 (01) ◽  
pp. 15-22 ◽  
Author(s):  
W. Schlegel ◽  
K. Kayser

A basic concept for the automatic diagnosis of histo-pathological specimen is presented. The algorithm is based on tissue structures of the original organ. Low power magnification was used to inspect the specimens. The form of the given tissue structures, e. g. diameter, distance, shape factor and number of neighbours, is measured. Graph theory is applied by using the center of structures as vertices and the shortest connection of neighbours as edges. The algorithm leads to two independent sets of parameters which can be used for diagnostic procedures. First results with colon tissue show significant differences between normal tissue, benign and malignant growth. Polyps form glands that are twice as wide as normal and carcinomatous tissue. Carcinomas can be separated by the minimal distance of the glands formed. First results of pattern recognition using graph theory are discussed.


2021 ◽  
Vol 12 (2) ◽  
Author(s):  
Mohammad Haekal ◽  
Henki Bayu Seta ◽  
Mayanda Mega Santoni

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.


2009 ◽  
Vol 7 (4) ◽  
pp. 846-856 ◽  
Author(s):  
Andrey Toropov ◽  
Alla Toropova ◽  
Emilio Benfenati

AbstractUsually, QSPR is not used to model organometallic compounds. We have modeled the octanol/water partition coefficient for organometallic compounds of Na, K, Ca, Cu, Fe, Zn, Ni, As, and Hg by optimal descriptors calculated with simplified molecular input line entry system (SMILES) notations. The best model is characterized by the following statistics: n=54, r2=0.9807, s=0.677, F=2636 (training set); n=26, r2=0.9693, s=0.969, F=759 (test set). Empirical criteria for the definition of the applicability domain for these models are discussed.


Sign in / Sign up

Export Citation Format

Share Document