scholarly journals Efficient data abstraction using weighted IB2 prototypes

2014 ◽  
Vol 11 (2) ◽  
pp. 665-678 ◽  
Author(s):  
Stefanos Ougiaroglou ◽  
Georgios Evangelidis

Data reduction techniques improve the efficiency of k-Nearest Neighbour classification on large datasets since they accelerate the classification process and reduce storage requirements for the training data. IB2 is an effective prototype selection data reduction technique. It selects some items from the initial training dataset and uses them as representatives (prototypes). Contrary to many other techniques, IB2 is a very fast, one-pass method that builds its reduced (condensing) set in an incremental manner. New training data can update the condensing set without the need of the ?old? removed items. This paper proposes a variation of IB2, that generates new prototypes instead of selecting them. The variation is called AIB2 and attempts to improve the efficiency of IB2 by positioning the prototypes in the center of the data areas they represent. The empirical experimental study conducted in the present work as well as the Wilcoxon signed ranks test show that AIB2 performs better than IB2.

Author(s):  
Josephine M. Namayanja

Computational techniques, such as Simple K, have been used for exploratory analysis in applications ranging from data mining research, machine learning, and computational biology. The medical domain has benefitted from these applications, and in this regard, the authors analyze patterns in individuals of selected age groups linked with the possibility of Metabolic Syndrome (MetS), a disorder affecting approximately 45% of the elderly. The study identifies groups of individuals behaving in two defined categories, that is, those diagnosed with MetS (MetS Positive) and those who are not (MetS Negative), comparing the pattern definition. The paper compares the cluster formation in patterns when using a data reduction technique referred to as Singular Value Decomposition (SVD) versus eliminating its application in clustering. Data reduction techniques like SVD have proved to be very useful in projecting only what is considered to be key relations in the data by suppressing the less important ones. With the existence of high dimensionality, the importance of SVD can be highly effective. By applying two internal measures to validate the cluster quality, findings in this study prove interesting in context to both approaches.


2019 ◽  
Vol 8 (3) ◽  
pp. 4373-4378

The amount of data belonging to different domains are being stored rapidly in various repositories across the globe. Extracting useful information from the huge volumes of data is always difficult due to the dynamic nature of data being stored. Data Mining is a knowledge discovery process used to extract the hidden information from the data stored in various repositories, termed as warehouses in the form of patterns. One of the popular tasks of data mining is Classification, which deals with the process of distinguishing every instance of a data set into one of the predefined class labels. Banking system is one of the realworld domains, which collects huge number of client data on a daily basis. In this work, we have collected two variants of the bank marketing data set pertaining to a Portuguese financial institution consisting of 41188 and 45211 instances and performed classification on them using two data reduction techniques. Attribute subset selection has been performed on the first data set and the training data with the selected features are used in classification. Principal Component Analysis has been performed on the second data set and the training data with the extracted features are used in classification. A deep neural network classification algorithm based on Backpropagation has been developed to perform classification on both the data sets. Finally, comparisons are made on the performance of each deep neural network classifier with the four standard classifiers, namely Decision trees, Naïve Bayes, Support vector machines, and k-nearest neighbors. It has been found that the deep neural network classifier outperforms the existing classifiers in terms of accuracy


Author(s):  
Pierre Meunier

Multivariate data reduction techniques such as principal components analysis (PCA), offer the potential of simplifying the task of designing and evaluating workspaces for anthropometric accommodation of the user population. Simplification occurs by reducing the number of variables that one has to consider while retaining most, e.g. 89%, of the original dataset's variability. The error introduced by choosing to ignore some (11%) of the variability is examined in this paper. A set of eight design mannequins was generated using a data reduction method developed for MIL-STD-1776A. These mannequins, which were located on the periphery of a circle encompassing 90%, 95% and 99% of the population on two principal components, were compared with the true multivariate 90%, 95% and 99% of the population. The PCA mannequins were found to include less of the population than originally intended. The degree to which the mannequins included the true percentage of the population was found to depend mainly on the size of the initial envelope (larger envelopes were closer to the true accommodation limits). The paper also discusses some of the limitations of using limited numbers of test cases to predict population accommodation.


Author(s):  
N. Li ◽  
N. Pfeifer

<p><strong>Abstract.</strong> Training dataset generation is a difficult and expensive task for LiDAR point classification, especially in the case of large area classification. We present a method to automatically extent a small set of training data by label propagation processing. The class labels could be correctly extended to their optimal neighbourhood, and the most informative points are selected and added into the training set. With the final extended training dataset, the overall (OA) classification could be increased by about 2%. We also show that this approach is stable regardless of the number of initial training points, and achieve better improvements especially stating with an extremely small initial training set.</p>


2021 ◽  
Author(s):  
Assima Rakhimbekova ◽  
Anton Lopukhov ◽  
Natalia L. Klyachko ◽  
Alexander Kabanov ◽  
Timur I. Madzhidov ◽  
...  

Active learning (AL) has become a subject of active recent research both in industry and academia as an efficient approach for rapid design and discovery of novel chemicals, materials, and polymers. The key advantages of this approach relate to its ability to (i) employ relatively small datasets for model development, (ii) iterate between model development and model assessment using small external datasets that can be either generated in focused experimental studies or formed from subsets of the initial training data, and (iii) progressively evolve models toward increasingly more reliable predictions and the identification of novel chemicals with the desired properties. Herein, we first compared various AL protocols for their effectiveness in finding biologically active molecules using synthetic datasets. We have investigated the dependency of AL performance on the size of the initial training set, the relative complexity of the task, and the choice of the initial training dataset. We found that AL techniques as applied to regression modeling offer no benefits over random search, while AL used for classification tasks performs better than models built for randomly selected training sets but still quite far from perfect. Using the best performing AL protocol, we have assessed the applicability of AL for the discovery of polymeric micelle formulations for poorly soluble drugs. Finally, the best performing AL approach was employed to discover and experimentally validate novel binding polymers for a case study of asialoglycoprotein receptor (ASGPR).


10.5772/55996 ◽  
2013 ◽  
Vol 10 (5) ◽  
pp. 240
Author(s):  
Khursheed Khursheed ◽  
Muhammad Imran ◽  
Naeem Ahmad ◽  
Mattias O'Nils

Sign in / Sign up

Export Citation Format

Share Document