scholarly journals On the need of preserving order of data when validating within-project defect classifiers

2020 ◽  
Vol 25 (6) ◽  
pp. 4805-4830
Author(s):  
Davide Falessi ◽  
Jacky Huang ◽  
Likhita Narayana ◽  
Jennifer Fong Thai ◽  
Burak Turhan

Abstract We are in the shoes of a practitioner who uses previous project releases’ data to predict which classes of the current release are defect-prone. In this scenario, the practitioner would like to use the most accurate classifier among the many available ones. A validation technique, hereinafter “technique”, defines how to measure the prediction accuracy of a classifier. Several previous research efforts analyzed several techniques. However, no previous study compared validation techniques in the within-project across-release class-level context or considered techniques that preserve the order of data. In this paper, we investigate which technique recommends the most accurate classifier. We use the last release of a project as the ground truth to evaluate the classifier’s accuracy and hence the ability of a technique to recommend an accurate classifier. We consider nine classifiers, two industry and 13 open projects, and three validation techniques: namely 10-fold cross-validation (i.e., the most used technique), bootstrap (i.e., the recommended technique), and walk-forward (i.e., a technique preserving the order of data). Our results show that: 1) classifiers differ in accuracy in all datasets regardless of their entity per value, 2) walk-forward outperforms both 10-fold cross-validation and bootstrap statistically in all three accuracy metrics: AUC of the selected classifier, bias and absolute bias, 3) surprisingly, all techniques resulted to be more prone to overestimate than to underestimate the performances of classifiers, and 3) the defect rate resulted in changing between the second and first half in both industry projects and 83% of open-source datasets. This study recommends the use of techniques that preserve the order of data such as walk-forward over 10-fold cross-validation and bootstrap in the within-project across-release class-level context given the above empirical results and that walk-forward is by nature more simple, inexpensive, and stable than the other two techniques.

2021 ◽  
Vol 17 (2) ◽  
pp. e1008767
Author(s):  
Zutan Li ◽  
Hangjin Jiang ◽  
Lingpeng Kong ◽  
Yuanyuan Chen ◽  
Kun Lang ◽  
...  

N6-methyladenine (6mA) is an important DNA modification form associated with a wide range of biological processes. Identifying accurately 6mA sites on a genomic scale is crucial for under-standing of 6mA’s biological functions. However, the existing experimental techniques for detecting 6mA sites are cost-ineffective, which implies the great need of developing new computational methods for this problem. In this paper, we developed, without requiring any prior knowledge of 6mA and manually crafted sequence features, a deep learning framework named Deep6mA to identify DNA 6mA sites, and its performance is superior to other DNA 6mA prediction tools. Specifically, the 5-fold cross-validation on a benchmark dataset of rice gives the sensitivity and specificity of Deep6mA as 92.96% and 95.06%, respectively, and the overall prediction accuracy is 94%. Importantly, we find that the sequences with 6mA sites share similar patterns across different species. The model trained with rice data predicts well the 6mA sites of other three species: Arabidopsis thaliana, Fragaria vesca and Rosa chinensis with a prediction accuracy over 90%. In addition, we find that (1) 6mA tends to occur at GAGG motifs, which means the sequence near the 6mA site may be conservative; (2) 6mA is enriched in the TATA box of the promoter, which may be the main source of its regulating downstream gene expression.


Mekatronika ◽  
2021 ◽  
Vol 3 (1) ◽  
pp. 27-31
Author(s):  
Ken-ji Ee ◽  
Ahmad Fakhri Bin Ab. Nasir ◽  
Anwar P. P. Abdul Majeed ◽  
Mohd Azraai Mohd Razman ◽  
Nur Hafieza Ismail

The animal classification system is a technology to classify the animal class (type) automatically and useful in many applications. There are many types of learning models applied to this technology recently. Nonetheless, it is worth noting that the extraction of the features and the classification of the animal features is non-trivial, particularly in the deep learning approach for a successful animal classification system. The use of Transfer Learning (TL) has been demonstrated to be a powerful tool in the extraction of essential features. However, the employment of such a method towards animal classification applications are somewhat limited. The present study aims to determine a suitable TL-conventional classifier pipeline for animal classification. The VGG16 and VGG19 were used in extracting features and then coupled with either k-Nearest Neighbour (k-NN) or Support Vector Machine (SVM) classifier. Prior to that, a total of 4000 images were gathered consisting of a total of five classes which are cows, goats, buffalos, dogs, and cats. The data was split into the ratio of 80:20 for train and test. The classifiers hyper parameters are tuned by the Grids Search approach that utilises the five-fold cross-validation technique. It was demonstrated from the study that the best TL pipeline identified is the VGG16 along with an optimised SVM, as it was able to yield an average classification accuracy of 0.975. The findings of the present investigation could facilitate animal classification application, i.e. for monitoring animals in wildlife.


Author(s):  
Nathan Swanson ◽  
Donald Koban ◽  
Patrick Brundage

AbstractApplying Google’s PageRank model to sports is a popular concept in contemporary sports ranking. However, there is limited evidence that rankings generated with PageRank models do well at predicting the winners of playoffs series. In this paper, we use a PageRank model to predict the outcomes of the 2008–2016 NHL playoffs. Unlike previous studies that use a uniform personalization vector, we incorporate Corsi statistics into a personalization vector, use a nine-fold cross validation to identify tuning parameters, and evaluate the prediction accuracy of the tuned model. We found our ratings had a 70% accuracy for predicting the outcome of playoff series, outperforming the Colley, Massey, Bradley-Terry, Maher, and Generalized Markov models by 5%. The implication of our results is that fitting parameter values and adding a personalization vector can lead to improved performance when using PageRank models.


2020 ◽  
Vol 6 (1) ◽  
pp. 1
Author(s):  
Irkham Widhi Saputro ◽  
Bety Wulan Sari

Universitas AMIKOM Yogyakarta adalah salah satu perguruan tinggi yang memiliki ribuan mahasiswa baru khususnya pada prodi Informatika. Pada tahun 2012 tercatat ada 1009 mahasiswa baru, dan pada tahun 2013 juga tercatat ada sebanyak 859 mahasiswa baru. Namun sayangnya, dari sekian banyak mahasiswa hanya sekitar 50% saja yang dapat lulus dengan tepat waktu. Data tersebut untuk membuat sistem klasifikasi menggunakan teknik data mining dengan metode Naïve Bayes. Dataset yang akan digunakan sebanyak 300 data yang bersumber dari data alumni angkatan 2012, dan 2013 dengan masing-masing data sebanyak 150. Data yang diperoleh memiliki 144 mahasiswa dengan keterangan lulus tepat waktu, dan 156 mahasiswa dengan keterangan lulus tidak tepat waktu. Proses pengujian akan dilakukan menggunakan metode 10-Fold Cross Validation, dan Confusion Matrix. Hasil pengujian menunjukkan bahwa rata-rata performa dari model Naïve Bayes mempunyai nilai akurasi sebesar 68%, nilai precision sebesar 61.3%, nilai recall sebesar 65.3%, dan nilai f1-score sebesar 61%. Nilai performa dari model dapat dipengaruhi oleh dataset yang digunakan untuk pembuatan model.Kata Kunci — data mining, Naïve Bayes, K-Fold Cross Validation, Confusion MatrixAMIKOM Yogyakarta University is one of the colleges that has thousands of new students, especially in the Informatics study program. In 2012 there were 1009 new students, and in 2013 there were 859 new students. But unfortunately, of the many students only around 50% can graduate on time. The data is to make the classification system using data mining techniques with the Naïve Bayes method. The dataset will be used as much as 300 data sourced from alumni data of 2012, and 2013 with each data as much as 150. The data obtained has 144 students with information passed on time, and 156 students with graduation information not on time. The testing process will be carried out using the 10-Fold Cross Validation, and Confusion Matrix method. The test results show that the average performance of the Naïve Bayes model has an accuracy value of 68%, precision value is 61.3%, recall value is 65.3%, and f1-score is 61%. The performance value of the model can be influenced by the dataset used for modeling.Keywords — data mining, classification, Naïve Bayes, graduation time


2020 ◽  
Vol 10 (7) ◽  
pp. 2265-2273 ◽  
Author(s):  
Ahmad H. Sallam ◽  
Emily Conley ◽  
Dzianis Prakapenka ◽  
Yang Da ◽  
James A. Anderson

The use of haplotypes may improve the accuracy of genomic prediction over single SNPs because haplotypes can better capture linkage disequilibrium and genomic similarity in different lines and may capture local high-order allelic interactions. Additionally, prediction accuracy could be improved by portraying population structure in the calibration set. A set of 383 advanced lines and cultivars that represent the diversity of the University of Minnesota wheat breeding program was phenotyped for yield, test weight, and protein content and genotyped using the Illumina 90K SNP Assay. Population structure was confirmed using single SNPs. Haplotype blocks of 5, 10, 15, and 20 adjacent markers were constructed for all chromosomes. A multi-allelic haplotype prediction algorithm was implemented and compared with single SNPs using both k-fold cross validation and stratified sampling optimization. After confirming population structure, the stratified sampling improved the predictive ability compared with k-fold cross validation for yield and protein content, but reduced the predictive ability for test weight. In all cases, haplotype predictions outperformed single SNPs. Haplotypes of 15 adjacent markers showed the best improvement in accuracy for all traits; however, this was more pronounced in yield and protein content. The combined use of haplotypes of 15 adjacent markers and training population optimization significantly improved the predictive ability for yield and protein content by 14.3 (four percentage points) and 16.8% (seven percentage points), respectively, compared with using single SNPs and k-fold cross validation. These results emphasize the effectiveness of using haplotypes in genomic selection to increase genetic gain in self-fertilized crops.


Author(s):  
Zhihao Ke ◽  
Xiaoning Liu ◽  
Yining Chen ◽  
Hongfu Shi ◽  
Zigang Deng

Abstract By the merits of self-stability and low energy consumption, high temperature superconducting (HTS) maglev has the potential to become a novel type of transportation mode. As a key index to guarantee the lateral self-stability of HTS maglev, guiding force has strong non-linearity and is determined by multitudinous factors, and these complexities impede its further researches. Compared to traditional finite element and polynomial fitting method, the prosperity of deep learning algorithms could provide another guiding force prediction approach, but the verification of this approach is still blank. Therefore, this paper establishes 5 different neural network models (RBF, DNN, CNN, RNN, LSTM) to predict HTS maglev guiding force, and compares their prediction efficiency based on 3720 pieces of collected data. Meanwhile, two adaptively iterative algorithms for parameters matrix and learning rate adjustment are proposed, which could effectively reduce computing time and unnecessary iterations. And according to the results, it is revealed that, the DNN model shows the best fitting goodness, while the LSTM model displays the smoothest fitting curve on guiding force prediction. Based on this discovery, the effects of learning rate and iterations on prediction accuracy of the constructed DNN model are studied. And the learning rate and iterations at the highest guiding force prediction accuracy are 0.00025 and 90000, respectively. Moreover, the K-fold cross validation method is also applied to this DNN model, whose result manifests the generalization and robustness of this DNN model. The imperative of K-fold cross validation method to ensure universality of guiding force prediction model is likewise assessed. This paper firstly combines HTS maglev guiding force prediction with deep learning algorithms considering different field cooling height, real-time magnetic flux density, liquid nitrogen temperature and motion direction of bulk. Additionally, this paper gives a convenient and efficient method for HTS guiding force prediction and parameter optimization.


2012 ◽  
Vol 542-543 ◽  
pp. 1438-1442
Author(s):  
Ting Hua Wang ◽  
Cai Yun Cai ◽  
Yan Liao

Kernel is a key component of the support vector machines (SVMs) and other kernel methods. Based on the data distributions of classes in the feature space, this paper proposed a model selection criterion to evaluate the goodness of a kernel in multiclass classification scenario. This criterion is computationally efficient and is differentiable with respect to the kernel parameters. Compared with the k-fold cross validation technique which is often regarded as a benchmark, this criterion is found to yield about the same performance with much less computational overhead.


Author(s):  
Marcus O. Olatoye ◽  
Zhenbin Hu ◽  
Geoffrey P. Morris

AbstractModifying plant architecture is often necessary for yield improvement and climate adaptation, but we lack understanding of the genotype-phenotype map for plant morphology in sorghum. Here, we use a nested association mapping (NAM) population that captures global allelic diversity of sorghum to characterize the genetics of leaf erectness, leaf width (at two stages), and stem diameter. Recombinant inbred lines (n = 2200) were phenotyped in multiple environments (35,200 observations) and joint linkage mapping was performed with ∼93,000 markers. Fifty-four QTL of small to large effect were identified for trait BLUPs (9–16 per trait) each explaining 0.4–4% of variation across the NAM population. While some of these QTL colocalize with sorghum homologs of grass genes [e.g. involved in hormone synthesis (maize spi1), floral transition (SbCN8), and transcriptional regulation of development (rice Ideal plant architecture1)], most QTL did not colocalize with an a priori candidate gene (82%). Genomic prediction accuracy was generally high in five-fold cross-validation (0.65–0.83), and varied from low to high in leave-one-family-out cross-validation (0.04–0.61). The findings provide a foundation to identify the molecular basis of architecture variation in sorghum and establish genomic-enabled breeding for improved plant architecture.Core ideasUnderstanding the genetics of plant architecture could facilitate the development of crop ideotypes for yield and adaptationThe genetics of plant architecture traits was characterized in sorghum using multi-environment phenotyping in a global nested association mapping populationFifty-five quantitative trait loci were identified; some colocalize with homologs of known developmental regulators but most do notGenomic prediction accuracy was consistently high in five-fold cross-validation, but accuracy varied considerably in leave-one-family-out predictions


Author(s):  
D. Mabuni ◽  
S. Aquter Babu

In machine learning data usage is the most important criterion than the logic of the program. With very big and moderate sized datasets it is possible to obtain robust and high classification accuracies but not with small and very small sized datasets. In particular only large training datasets are potential datasets for producing robust decision tree classification results. The classification results obtained by using only one training and one testing dataset pair are not reliable. Cross validation technique uses many random folds of the same dataset for training and validation. In order to obtain reliable and statistically correct classification results there is a need to apply the same algorithm on different pairs of training and validation datasets. To overcome the problem of the usage of only a single training dataset and a single testing dataset the existing k-fold cross validation technique uses cross validation plan for obtaining increased decision tree classification accuracy results. In this paper a new cross validation technique called prime fold is proposed and it is experimentally tested thoroughly and then verified correctly using many bench mark UCI machine learning datasets. It is observed that the prime fold based decision tree classification accuracy results obtained after experimentation are far better than the existing techniques of finding decision tree classification accuracies.


Sign in / Sign up

Export Citation Format

Share Document