Counting the Hidden Defects in Software Documents

Handbook of Research on Machine Learning Applications and Trends ◽

10.4018/978-1-60566-766-9.ch025 ◽

2010 ◽

pp. 519-538

Author(s):

Frank Padberg

Keyword(s):

Neural Networks ◽

Training Data ◽

Estimation Methods ◽

Model Parameters ◽

Small Data ◽

Data Sets ◽

Software Inspection ◽

Bayesian Techniques ◽

Standard Quality ◽

Small Data Sets

The author uses neural networks to estimate how many defects are hidden in a software document. Input for the models are metrics that get collected when effecting a standard quality assurance technique on the document, a software inspection. For inspections, the empirical data sets typically are small. The author identifies two key ingredients for a successful application of neural networks to small data sets: Adapting the size, complexity, and input dimension of the networks to the amount of information available for training; and using Bayesian techniques instead of cross-validation for determining model parameters and selecting the final model. For inspections, the machine learning approach is highly successful and outperforms the previously existing defect estimation methods in software engineering by a factor of 4 in accuracy on the standard benchmark. The author’s approach is well applicable in other contexts that are subject to small training data sets.

Download Full-text

Neural Networks - Their Use and Abuse for Small Data Sets

Heuristic and Optimization for Knowledge Discovery ◽

10.4018/978-1-930708-26-6.ch015 ◽

2011 ◽

pp. 169-185

Author(s):

Denny Meyer ◽

Andrew Balemi ◽

Chris Wearing

Keyword(s):

Neural Networks ◽

Research Data ◽

Diagnostic Tools ◽

Small Data ◽

Data Sets ◽

Mathematical Form ◽

Statistical Tools ◽

P Values ◽

Model Interpretation ◽

Small Data Sets

Neural networks are commonly used for prediction and classification when data sets are large. They have a big advantage over conventional statistical tools in that it is not necessary to assume any mathematical form for the functional relationship between the variables. However, they also have a few associated problems, chief of which are probably the risk of over-parametrization in the absence of P-values, the lack of appropriate diagnostic tools and the difficulties associated with model interpretation. These problems are particularly pertinent in the case of small data sets. This chapter investigates these problems from a statistical perspective in the context of typical market research data.

Download Full-text

The Application of Artificial Neural Networks With Small Data Sets: An Example for Analysis of Fracture Spacing in the Lisburne Formation, Northeastern Alaska

SPE Reservoir Evaluation & Engineering ◽

10.2118/103188-pa ◽

2008 ◽

Vol 11 (03) ◽

pp. 598-605 ◽

Cited By ~ 10

Author(s):

Danial Kaviani ◽

Thang Bui ◽

Jerry L. Jensen ◽

Catherine Hanks

Keyword(s):

Neural Networks ◽

Artificial Neural Networks ◽

Petroleum Engineering ◽

Small Data ◽

Data Sets ◽

Classification Problems ◽

Error Performance ◽

Data Set ◽

Fracture Spacing ◽

Small Data Sets

Summary Artificial neural networks (ANNs) have been used widely for prediction and classification problems. In particular, many methods for building ANNs have appeared in the last 2 decades. One of the continuing important limitations of using ANNs, however, is their poor ability to analyze small data sets because of overfitting. Several methods have been proposed in the literature to overcome this problem. On the basis of our study, we can conclude that ANNs that use radial basis functions (RBFs) can decrease the error of the prediction effectively when there is an underlying relationship between the variables. We have applied this and other methods to determine the factors controlling and related to fracture spacing in the Lisburne formation, northeastern Alaska. By comparing the RBF results with those from other ANN methods, we find that the former method gives a substantially smaller error than many of the alternative methods. For example, the errors in predicted fracture spacing for the Lisburne formation with conventional ANN methods are approximately 50 to 200% larger than those obtained with RBFs. With a method that predicts fracture spacing more accurately, we were able to identify more reliably the effects on the spacing of such factors as bed thickness, lithology, structural position, and degree of folding. By comparing performances of all the methods we tested, we observed that some methods that performed well in one test did not necessarily do as well in another test. This suggests that, while RBF can be expected to be among the best methods, there is no "best universal method" for all the cases, and testing different methods for each case is required. Nonetheless, through this study, we were able to identify several candidate methods and, thereby, narrow the work required to find a suitable ANN. In petroleum engineering and geosciences, the number of data is limited in many cases because of expense or logistical limitations (e.g., limited core, poor borehole conditions, or restricted logging suites). Thus, the methods used in this study should be attractive in many petroleum-engineering contexts in which complex, nonlinear relationships need to be modeled by use of small data sets. Introduction An ANN is "an information-processing system that has certain performance characteristics in common with biological neural networks" (Fausett 1994). On the basis of the "universal approximation theorem" with a sufficient number of hidden nodes, multilayer neural networks (Fig. 1) are able to predict any unknown function (Haykin 1999). ANNs are widely used in prediction and classification problems and have numerous applications in geosciences and petroleum engineering, including permeability prediction (Aminian et al. 2003), fluid-properties prediction (Sultan and Al-Kaabi 2002), and well-test-data analysis (Osman and Al-Marhoun 2005). Given a basic network structure, there is a wide variety of ANNs that can be produced. For example, different methods or criteria used to train the network produce ANNs that provide different predictions (e.g., the early-stopping and weight-decay methods.) Also, two or more neural networks can be combined to produce an ANN with better error performance or other qualities, giving the so-called "ensemble learning methods," a term that covers a large variety of methods, including stacked generalization and ensemble averaging. An additional problem is introduced when the data sets are small. This is a common situation in petroleum-engineering and geosciences applications, in which the cost of data or collection logistics may limit the number of measurements. In such instances, the use of ANNs can result in overfitting, where the model is fitted to the training data points but performs poorly for prediction of other points (Fig. 2). In this study, we try to identify—among myriad possibilities—a few ANNs that provide good error performance with limited sample numbers. After a brief review of various types of ANNs, we use a synthetic data set to discuss, apply, and compare the methods that have been proposed in the literature to overcome the small-data-sets problem. Finally, we apply these methods to an actual data set—fracture-spacing data from the Lisburne Group, northeastern Alaska—and evaluate the results.

Download Full-text

Comparison of inference methods for estimating semivariogram model parameters and their uncertainty: The case of small data sets

Computers & Geosciences ◽

10.1016/j.cageo.2012.06.002 ◽

2013 ◽

Vol 50 ◽

pp. 154-164 ◽

Cited By ~ 13

Author(s):

Eulogio Pardo-Igúzquiza ◽

Peter A. Dowd

Keyword(s):

Model Parameters ◽

Small Data ◽

Data Sets ◽

Semivariogram Model ◽

Small Data Sets ◽

Inference Methods

Download Full-text

Training Data Extraction and Object Detection in Surveillance Scenario

Sensors ◽

10.3390/s20092689 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2689 ◽

Cited By ~ 1

Author(s):

Artur Wilkowski ◽

Maciej Stefańczyk ◽

Włodzimierz Kasprzak

Keyword(s):

Public Space ◽

Data Extraction ◽

Criminal Activity ◽

Training Data ◽

Support Vector ◽

Small Data ◽

Data Sets ◽

Surveillance Systems ◽

Automatic Data ◽

Small Data Sets

Police and various security services use video analysis for securing public space, mass events, and when investigating criminal activity. Due to a huge amount of data supplied to surveillance systems, some automatic data processing is a necessity. In one typical scenario, an operator marks an object in an image frame and searches for all occurrences of the object in other frames or even image sequences. This problem is hard in general. Algorithms supporting this scenario must reconcile several seemingly contradicting factors: training and detection speed, detection reliability, and learning from small data sets. In the system proposed here, we use a two-stage detector. The first region proposal stage is based on a Cascade Classifier while the second classification stage is based either on a Support Vector Machines (SVMs) or Convolutional Neural Networks (CNNs). The proposed configuration ensures both speed and detection reliability. In addition to this, an object tracking and background-foreground separation algorithm is used, supported by the GrabCut algorithm and a sample synthesis procedure, in order to collect rich training data for the detector. Experiments show that the system is effective, useful, and applicable to practical surveillance tasks.

Download Full-text

The Application of Artificial Neural Networks With Small Data Sets: An Example for Analysis of Fracture Spacing in the Lisburne Formation, Northeastern Alaska

10.2118/103188-ms ◽

2006 ◽

Author(s):

Danial Kaviani ◽

Thang Bui ◽

Jerry L. Jensen ◽

Catherine Hanks

Keyword(s):

Neural Networks ◽

Artificial Neural Networks ◽

Small Data ◽

Data Sets ◽

Fracture Spacing ◽

Small Data Sets ◽

Artificial Neural

Download Full-text

Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets

Statistics in Medicine ◽

10.1002/(sici)1097-0258(20000430)19:8<1059::aid-sim412>3.0.co;2-0 ◽

2000 ◽

Vol 19 (8) ◽

pp. 1059-1079 ◽

Cited By ~ 405

Author(s):

Ewout W. Steyerberg ◽

Marinus J.C. Eijkemans ◽

Frank E. Harrell ◽

J. Dik F. Habbema

Keyword(s):

Logistic Regression ◽

Regression Analysis ◽

Logistic Regression Analysis ◽

Estimation Methods ◽

Small Data ◽

Data Sets ◽

Small Data Sets ◽

Prognostic Modelling

Download Full-text

Classification of jujube defects in small data sets based on transfer learning

Neural Computing and Applications ◽

10.1007/s00521-021-05715-2 ◽

2021 ◽

Author(s):

Jianping Ju ◽

Hong Zheng ◽

Xiaohang Xu ◽

Zhongyuan Guo ◽

Zhaohui Zheng ◽

...

Keyword(s):

Transfer Learning ◽

Loss Function ◽

Training Model ◽

Parameter Distribution ◽

Test Accuracy ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Small Data Sets

AbstractAlthough convolutional neural networks have achieved success in the field of image classification, there are still challenges in the field of agricultural product quality sorting such as machine vision-based jujube defects detection. The performance of jujube defect detection mainly depends on the feature extraction and the classifier used. Due to the diversity of the jujube materials and the variability of the testing environment, the traditional method of manually extracting the features often fails to meet the requirements of practical application. In this paper, a jujube sorting model in small data sets based on convolutional neural network and transfer learning is proposed to meet the actual demand of jujube defects detection. Firstly, the original images collected from the actual jujube sorting production line were pre-processed, and the data were augmented to establish a data set of five categories of jujube defects. The original CNN model is then improved by embedding the SE module and using the triplet loss function and the center loss function to replace the softmax loss function. Finally, the depth pre-training model on the ImageNet image data set was used to conduct training on the jujube defects data set, so that the parameters of the pre-training model could fit the parameter distribution of the jujube defects image, and the parameter distribution was transferred to the jujube defects data set to complete the transfer of the model and realize the detection and classification of the jujube defects. The classification results are visualized by heatmap through the analysis of classification accuracy and confusion matrix compared with the comparison models. The experimental results show that the SE-ResNet50-CL model optimizes the fine-grained classification problem of jujube defect recognition, and the test accuracy reaches 94.15%. The model has good stability and high recognition accuracy in complex environments.

Download Full-text

45 A permutation test for validation of genomic estimated breeding values

Journal of Animal Science ◽

10.1093/jas/skaa278.016 ◽

2020 ◽

Vol 98 (Supplement_4) ◽

pp. 8-9

Author(s):

Zahra Karimi ◽

Brian Sullivan ◽

Mohsen Jafarikia

Keyword(s):

Permutation Test ◽

Breeding Value ◽

Small Data ◽

Data Sets ◽

Type I ◽

Future Performance ◽

Small Data Sets ◽

Estimated Breeding Value ◽

Estimated Breeding Values ◽

Top 40

Abstract Previous studies have shown that the accuracy of Genomic Estimated Breeding Value (GEBV) as a predictor of future performance is higher than the traditional Estimated Breeding Value (EBV). The purpose of this study was to estimate the potential advantage of selection on GEBV for litter size (LS) compared to selection on EBV in the Canadian swine dam line breeds. The study included 236 Landrace and 210 Yorkshire gilts born in 2017 which had their first farrowing after 2017. GEBV and EBV for LS were calculated with data that was available at the end of 2017 (GEBV2017 and EBV2017, respectively). De-regressed EBV for LS in July 2019 (dEBV2019) was used as an adjusted phenotype. The average dEBV2019 for the top 40% of sows based on GEBV2017 was compared to the average dEBV2019 for the top 40% of sows based on EBV2017. The standard error of the estimated difference for each breed was estimated by comparing the average dEBV2019 for repeated random samples of two sets of 40% of the gilts. In comparison to the top 40% ranked based on EBV2017, ranking based on GEBV2017 resulted in an extra 0.45 (±0.29) and 0.37 (±0.25) piglets born per litter in Landrace and Yorkshire replacement gilts, respectively. The estimated Type I errors of the GEBV2017 gain over EBV2017 were 6% and 7% in Landrace and Yorkshire, respectively. Considering selection of both replacement boars and replacement gilts using GEBV instead of EBV can translate into increased annual genetic gain of 0.3 extra piglets per litter, which would more than double the rate of gain observed from typical EBV based selection. The permutation test for validation used in this study appears effective with relatively small data sets and could be applied to other traits, other species and other prediction methods.

Download Full-text

A Comparison Study of Mahalanobis-Taguchi System and Neural Network for Multivariate Pattern Recognition

Design Engineering, Parts A and B ◽

10.1115/imece2005-80029 ◽

2005 ◽

Cited By ~ 10

Author(s):

Jungeui Hong ◽

Elizabeth A. Cudney ◽

Genichi Taguchi ◽

Rajesh Jugulum ◽

Kioumars Paryani ◽

...

Keyword(s):

Neural Network ◽

Small Data ◽

Data Sets ◽

Comparison Study ◽

Data Set ◽

Set Size ◽

Breast Cancer Study ◽

Discriminant Ability ◽

Small Data Sets ◽

Multivariate Pattern

The Mahalanobis-Taguchi System is a diagnosis and predictive method for analyzing patterns in multivariate cases. The goal of this study is to compare the ability of the Mahalanobis-Taguchi System and a neural network to discriminate using small data sets. We examine the discriminant ability as a function of data set size using an application area where reliable data is publicly available. The study uses the Wisconsin Breast Cancer study with nine attributes and one class.

Download Full-text

Ensemble CNN in Transform Domains for Image Super-resolution from Small Data Sets

2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA) ◽

10.1109/icmla51294.2020.00068 ◽

2020 ◽

Author(s):

Yingnan Liu ◽

Randy Clinton Paffenroth

Keyword(s):

Super Resolution ◽

Small Data ◽

Data Sets ◽

Small Data Sets ◽

Image Super Resolution ◽

Transform Domains

Download Full-text