AN EMPIRICAL STUDY OF FEATURE RANKING TECHNIQUES FOR SOFTWARE QUALITY PREDICTION

Author(s):  
TAGHI M. KHOSHGOFTAAR ◽  
KEHAN GAO ◽  
AMRI NAPOLITANO

The primary goal of software quality engineering is to produce a high quality software product through the use of some specific techniques and processes. One strategy is applying data mining techniques to software metric and defect data collected during the software development process to identify potential low-quality program modules. In this paper, we investigate the use of feature selection in the context of software quality estimation (also referred to as software defect prediction), where a classification model is used to predict whether program modules (instances) are fault-prone or not-fault-prone. Seven filter-based feature ranking techniques are examined. Among them, six are commonly used, and the other one, named signal to noise ratio (SNR), is rarely employed. The objective of the paper is to compare these seven techniques for various software data sets and assess their effectiveness for software quality modeling. A case study is performed on 16 software data sets, and classification models are built with five different learners and evaluated with two performance metrics. Our experimental results are summarized based on statistical tests for significance. The main conclusion is that the SNR technique performs as well as the best performer of the six commonly used techniques.

Author(s):  
Yi Liu ◽  
Taghi M. Khoshgoftaar

A software quality estimation model is an important tool for a given software quality assurance initiative. Software quality classification models can be used to indicate which program modules are fault-prone (FP) and not fault-prone (NFP). Such models assume that enough resources are available for quality improvement of all the modules predicted as FP. In conjunction with a software quality classification model, a quality-based ranking of program modules has practical benefits since priority can be given to modules that are more FP. However, such a ranking cannot be achieved by traditional classification techniques. We present a novel software quality classification model based on multi-objective optimization with genetic programming (GP). More specifically, the GP-based model provides both a classification (FP or NFP) and a quality-based ranking for the program modules. The quality factor used to rank the modules is typically the number of faults or defects associated with a module. Genetic programming is ideally suited for optimizing multiple criteria simultaneously. In our study, three performance criteria are used to evolve a GP-based software quality model: classification performance, module ranking, and size of the GP tree. The third criterion addresses a commonly observed phenomena in GP,that is, bloating. The proposed model is investigated with case studies of software measurement data obtained from two industrial software systems.


2016 ◽  
Vol 25 (01) ◽  
pp. 1550028 ◽  
Author(s):  
Mete Celik ◽  
Fehim Koylu ◽  
Dervis Karaboga

In data mining, classification rule learning extracts the knowledge in the representation of IF_THEN rule which is comprehensive and readable. It is a challenging problem due to the complexity of data sets. Various meta-heuristic machine learning algorithms are proposed for rule learning. Cooperative rule learning is the discovery process of all classification rules with a single run concurrently. In this paper, a novel cooperative rule learning algorithm, called CoABCMiner, based on Artificial Bee Colony is introduced. The proposed algorithm handles the training data set and discovers the classification model containing the rule list. Token competition, new updating strategy used in onlooker and employed phases, and new scout bee mechanism are proposed in CoABCMiner to achieve cooperative learning of different rules belonging to different classes. We compared the results of CoABCMiner with several state-of-the-art algorithms using 14 benchmark data sets. Non parametric statistical tests, such as Friedman test, post hoc test, and contrast estimation based on medians are performed. Nonparametric tests determine the similarity of control algorithm among other algorithms on multiple problems. Sensitivity analysis of CoABCMiner is conducted. It is concluded that CoABCMiner can be used to discover classification rules for the data sets used in experiments, efficiently.


2020 ◽  
pp. 1-14
Author(s):  
Esraa Hassan ◽  
Noha A. Hikal ◽  
Samir Elmuogy

Nowadays, Coronavirus (COVID-19) considered one of the most critical pandemics in the earth. This is due its ability to spread rapidly between humans as well as animals. COVID_19 expected to outbreak around the world, around 70 % of the earth population might infected with COVID-19 in the incoming years. Therefore, an accurate and efficient diagnostic tool is highly required, which the main objective of our study. Manual classification was mainly used to detect different diseases, but it took too much time in addition to the probability of human errors. Automatic image classification reduces doctors diagnostic time, which could save human’s life. We propose an automatic classification architecture based on deep neural network called Worried Deep Neural Network (WDNN) model with transfer learning. Comparative analysis reveals that the proposed WDNN model outperforms by using three pre-training models: InceptionV3, ResNet50, and VGG19 in terms of various performance metrics. Due to the shortage of COVID-19 data set, data augmentation was used to increase the number of images in the positive class, then normalization used to make all images have the same size. Experimentation is done on COVID-19 dataset collected from different cases with total 2623 where (1573 training,524 validation,524 test). Our proposed model achieved 99,046, 98,684, 99,119, 98,90 In terms of Accuracy, precision, Recall, F-score, respectively. The results are compared with both the traditional machine learning methods and those using Convolutional Neural Networks (CNNs). The results demonstrate the ability of our classification model to use as an alternative of the current diagnostic tool.


2021 ◽  
pp. 1-13
Author(s):  
Qingtian Zeng ◽  
Xishi Zhao ◽  
Xiaohui Hu ◽  
Hua Duan ◽  
Zhongying Zhao ◽  
...  

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.


2017 ◽  
Vol 2017 ◽  
pp. 1-8 ◽  
Author(s):  
Wanzeng Kong ◽  
Jinshuai Yu ◽  
Ying Cheng ◽  
Weihua Cong ◽  
Huanhuan Xue

With 3D imaging of the multisonar beam and serious interference of image noise, detecting objects based only on manual operation is inefficient and also not conducive to data storage and maintenance. In this paper, a set of sonar image automatic detection technologies based on 3D imaging is developed to satisfy the actual requirements in sonar image detection. Firstly, preprocessing was conducted to alleviate the noise and then the approximate position of object was obtained by calculating the signal-to-noise ratio of each target. Secondly, the separation of water bodies and strata is realized by maximum variance between clusters (OTSU) since there exist obvious differences between these two areas. Thus image segmentation can be easily implemented on both. Finally, the feature extraction is carried out, and the multidimensional Bayesian classification model is established to do classification. Experimental results show that the sonar-image-detection technology can effectively detect the target and meet the requirements of practical applications.


Author(s):  
Ned Augenblick ◽  
Matthew Rabin

Abstract When a Bayesian learns new information and changes her beliefs, she must on average become concomitantly more certain about the state of the world. Consequently, it is rare for a Bayesian to frequently shift beliefs substantially while remaining relatively uncertain, or, conversely, become very confident with relatively little belief movement. We formalize this intuition by developing specific measures of movement and uncertainty reduction given a Bayesian’s changing beliefs over time, showing that these measures are equal in expectation and creating consequent statistical tests for Bayesianess. We then show connections between these two core concepts and four common psychological biases, suggesting that the test might be particularly good at detecting these biases. We provide support for this conclusion by simulating the performance of our test and other martingale tests. Finally, we apply our test to data sets of individual, algorithmic, and market beliefs.


2016 ◽  
Vol 16 (24) ◽  
pp. 15545-15559 ◽  
Author(s):  
Ernesto Reyes-Villegas ◽  
David C. Green ◽  
Max Priestman ◽  
Francesco Canonaco ◽  
Hugh Coe ◽  
...  

Abstract. The multilinear engine (ME-2) factorization tool is being widely used following the recent development of the Source Finder (SoFi) interface at the Paul Scherrer Institute. However, the success of this tool, when using the a value approach, largely depends on the inputs (i.e. target profiles) applied as well as the experience of the user. A strategy to explore the solution space is proposed, in which the solution that best describes the organic aerosol (OA) sources is determined according to the systematic application of predefined statistical tests. This includes trilinear regression, which proves to be a useful tool for comparing different ME-2 solutions. Aerosol Chemical Speciation Monitor (ACSM) measurements were carried out at the urban background site of North Kensington, London from March to December 2013, where for the first time the behaviour of OA sources and their possible environmental implications were studied using an ACSM. Five OA sources were identified: biomass burning OA (BBOA), hydrocarbon-like OA (HOA), cooking OA (COA), semivolatile oxygenated OA (SVOOA) and low-volatility oxygenated OA (LVOOA). ME-2 analysis of the seasonal data sets (spring, summer and autumn) showed a higher variability in the OA sources that was not detected in the combined March–December data set; this variability was explored with the triangle plots f44 : f43 f44 : f60, in which a high variation of SVOOA relative to LVOOA was observed in the f44 : f43 analysis. Hence, it was possible to conclude that, when performing source apportionment to long-term measurements, important information may be lost and this analysis should be done to short periods of time, such as seasonally. Further analysis on the atmospheric implications of these OA sources was carried out, identifying evidence of the possible contribution of heavy-duty diesel vehicles to air pollution during weekdays compared to those fuelled by petrol.


2021 ◽  
Author(s):  
Luz Karime Atencia ◽  
María Gómez del Campo ◽  
Gema Camacho ◽  
Antonio Hueso ◽  
Ana M. Tarquis

<p>Olive is the main fruit tree in Spain representing 50% of the fruit trees surface, around 2,751,255 ha. Due to its adaptation to arid conditions and the scarcity of water, regulated deficit irrigation (RDI) strategy is normally applied in traditional olive orchards and recently to high density orchards. The application of RDI is one of the most important technique used in the olive hedgerow orchard. An investigation of the detection of water stress in nonhomogeneous olive tree canopies such as orchards using remote sensing imagery is presented.</p><p>In 2018 and 2019 seasons, data on stem water potential were collected to characterize tree water state in a hedgerow olive orchard cv. Arbequina located in Chozas de Canales (Toledo). Close to the measurement’s dates, remote sensing images with spectral and thermal sensors were acquired. Several vegetation indexes (VI) using both or one type of sensors were estimated from the areas selected that correspond to the olive crown avoiding the canopy shadows.</p><p>Nonparametric statistical tests between the VIs and the stem water potential were carried out to reveal the most significant correlation. The results will be discussing in the context of robustness and sensitivity between both data sets at different phenological olive state.</p><p><strong>ACKNOWLODGEMENTS</strong></p><p>Financial support provided by the Spanish Research Agency co-financed with European Union FEDER funds (AEI/FEDER, UE, AGL2016-77282-C3-2R project) and Comunidad de Madrid through calls for grants for the completion of Industrial Doctorates, is greatly appreciated.</p>


2018 ◽  
Vol 8 (12) ◽  
pp. 2421 ◽  
Author(s):  
Chongya Song ◽  
Alexander Pons ◽  
Kang Yen

In the field of network intrusion, malware usually evades anomaly detection by disguising malicious behavior as legitimate access. Therefore, detecting these attacks from network traffic has become a challenge in this an adversarial setting. In this paper, an enhanced Hidden Markov Model, called the Anti-Adversarial Hidden Markov Model (AA-HMM), is proposed to effectively detect evasion pattern, using the Dynamic Window and Threshold techniques to achieve adaptive, anti-adversarial, and online-learning abilities. In addition, a concept called Pattern Entropy is defined and acts as the foundation of AA-HMM. We evaluate the effectiveness of our approach employing two well-known benchmark data sets, NSL-KDD and CTU-13, in terms of the common performance metrics and the algorithm’s adaptation and anti-adversary abilities.


2009 ◽  
Vol 9 (23) ◽  
pp. 9101-9110 ◽  
Author(s):  
V. Grewe ◽  
R. Sausen

Abstract. This comment focuses on the statistical limitations of a model grading, as applied by D. Waugh and V. Eyring (2008) (WE08). The grade g is calculated for a specific diagnostic, which basically relates the difference of means of model and observational data to the standard deviation in the observational dataset. We performed Monte Carlo simulations, which show that this method has the potential to lead to large 95%-confidence intervals for the grade. Moreover, the difference between two model grades often has to be very large to become statistically significant. Since the confidence intervals were not considered in detail for all diagnostics, the grading in WE08 cannot be interpreted, without further analysis. The results of the statistical tests performed in WE08 agree with our findings. However, most of those tests are based on special cases, which implicitely assume that observations are available without any errors and that the interannual variability of the observational data and the model data are equal. Without these assumptions, the 95%-confidence intervals become even larger. Hence, the case, where we assumed perfect observations (ignored errors), provides a good estimate for an upper boundary of the threshold, below that a grade becomes statistically significant. Examples have shown that the 95%-confidence interval may even span the whole grading interval [0, 1]. Without considering confidence intervals, the grades presented in WE08 do not allow to decide whether a model result significantly deviates from reality. Neither in WE08 nor in our comment it is pointed out, which of the grades presented in WE08 inhibits such kind of significant deviation. However, our analysis of the grading method demonstrates the unacceptably high potential for these grades to be insignificant. This implies that the grades given by WE08 can not be interpreted by the reader. We further show that the inclusion of confidence intervals into the grading approach is necessary, since otherwise even a perfect model may get a low grade.


Sign in / Sign up

Export Citation Format

Share Document