Two-stage svm classification for large data sets via randomly reducing and recovering training data

Author(s):  
Xiaoou Li ◽  
Jair Cervantes ◽  
Wen Yu
2020 ◽  
Vol 36 (4) ◽  
pp. 803-825
Author(s):  
Marco Fortini

AbstractRecord linkage addresses the problem of identifying pairs of records coming from different sources and referred to the same unit of interest. Fellegi and Sunter propose an optimal statistical test in order to assign the match status to the candidate pairs, in which the needed parameters are obtained through EM algorithm directly applied to the set of candidate pairs, without recourse to training data. However, this procedure has a quadratic complexity as the two lists to be matched grow. In addition, a large bias of EM-estimated parameters is also produced in this case, so that the problem is tackled by reducing the set of candidate pairs through filtering methods such as blocking. Unfortunately, the probability that excluded pairs would be actually true-matches cannot be assessed through such methods.The present work proposes an efficient approach in which the comparison of records between lists are minimised while the EM estimates are modified by modelling tables with structural zeros in order to obtain unbiased estimates of the parameters. Improvement achieved by the suggested method is shown by means of simulations and an application based on real data.


2015 ◽  
Vol 37 ◽  
pp. 787-798 ◽  
Author(s):  
Jair Cervantes ◽  
Farid García Lamont ◽  
Asdrúbal López-Chau ◽  
Lisbeth Rodríguez Mazahua ◽  
J. Sergio Ruíz

Author(s):  
Ignasi Echaniz Soldevila ◽  
Victor L. Knoop ◽  
Serge Hoogendoorn

Traffic engineers rely on microscopic traffic models to design, plan, and operate a wide range of traffic applications. Recently, large data sets, yet incomplete and from small space regions, are becoming available thanks to technology improvements and governmental efforts. With this study we aim to gain new empirical insights into longitudinal driving behavior and to formulate a model which can benefit from these new challenging data sources. This paper proposes an application of an existing formulation, Gaussian process regression (GPR), to describe individual longitudinal driving behavior of drivers. The method integrates a parametric and a non-parametric mathematical formulation. The model predicts individual driver’s acceleration given a set of variables. It uses the GPR to make predictions when there exists correlation between new input and the training data set. The data-driven model benefits from a large training data set to capture all driver longitudinal behavior, which would be difficult to fit in fixed parametric equation(s). The methodology allows us to train models with new variables without the need of altering the model formulation. And importantly, the model also uses existing traditional parametric car-following models to predict acceleration when no similar situations are found in the training data set. A case study using radar data in an urban environment shows that a hybrid model performs better than parametric model alone and suggests that traffic light status over time influences drivers’ acceleration. This methodology can help engineers to use large data sets and to find new variables to describe traffic behavior.


2018 ◽  
Vol 25 (3) ◽  
pp. 655-670 ◽  
Author(s):  
Tsung-Wei Ke ◽  
Aaron S. Brewster ◽  
Stella X. Yu ◽  
Daniela Ushizima ◽  
Chao Yang ◽  
...  

A new tool is introduced for screening macromolecular X-ray crystallography diffraction images produced at an X-ray free-electron laser light source. Based on a data-driven deep learning approach, the proposed tool executes a convolutional neural network to detect Bragg spots. Automatic image processing algorithms described can enable the classification of large data sets, acquired under realistic conditions consisting of noisy data with experimental artifacts. Outcomes are compared for different data regimes, including samples from multiple instruments and differing amounts of training data for neural network optimization.


Author(s):  
Xingjie Fang ◽  
Liping Wang ◽  
Don Beeson ◽  
Gene Wiggs

Radial Basis Function (RBF) metamodels have recently attracted increased interest due to their significant advantages over other types of non-parametric metamodels. However, because of the interpolation nature of the RBF mathematics, the accuracy of the model may dramatically deteriorate if the training data set used contains duplicate information, noise or outliers. Also constructing the metamodel may be time consuming whenever the training data sets are large or a high dimensional model is required. In this paper, we propose a robust and efficient RBF metamodeling approach based on data pre-processing techniques that alleviate the accuracy and efficiency issues commonly encountered when RBF models are used in typical real engineering situations. These techniques include 1) the removal of duplicate training data information, 2) the generation of smaller uniformly distributed subsets of training data from large data sets and 3) the quantification and identification of outliers by principal component analysis (PCA) and Hotelling statistics. Simulation results are used to validate the generalization accuracy and efficiency of the proposed approach.


2018 ◽  
Vol 2 (3) ◽  
pp. 324-335 ◽  
Author(s):  
Johannes Kvam ◽  
Lars Erik Gangsei ◽  
Jørgen Kongsro ◽  
Anne H Schistad Solberg

Abstract Computed tomography (CT) scanning of pigs has been shown to produce detailed phenotypes useful in pig breeding. Due to the large number of individuals scanned and corresponding large data sets, there is a need for automatic tools for analysis of these data sets. In this paper, the feasibility of deep learning for fully automatic segmentation of the skeleton of pigs from CT volumes is explored. To maximize performance, given the training data available, a series of problem simplifications are applied. The deep-learning approach can replace our currently used semiautomatic solution, with increased robustness and little or no need for manual control. Accuracy was highly affected by training data, and expanding the training set can further increase performance making this approach especially promising.


2018 ◽  
Vol 10 (1) ◽  
Author(s):  
Ion Matei ◽  
Maksym Zhenirovskyy ◽  
Johan De Kleer ◽  
Alexander Feldman

Machine learning based diagnosis engines require large data sets for training. When experimental data is insucient, system models can be used to supplement the data. Such models are typically simplified and imprecise, hence with some degree of uncertainty. In this paper we show how to deal with uncertainty in synthetic training data. The data is produced using a model with uncertainties. The uncertainties originate from inaccurate parameter values or parameters that take dierent values based on the mode of operation. We demonstrate how techniques from the uncertainty quantification field can be used to reduce the numerical complexity of the training algorithm. In particular, we use generalize polynomial chaos to eciently approximate the loss function. In addition, we present a neural network architecture specifically designed to deal with uncertainties in the training data. As an illustrative example, we show how our approach can be used to detect faults in an elevator system.


Sign in / Sign up

Export Citation Format

Share Document