scholarly journals Bug Severity Prediction using Keywords in Imbalanced Learning Environment

Author(s):  
Jayalath Ekanayake ◽  

Reported bugs of software systems are classified into different severity levels before fixing them. The number of bug reports may not be equally distributed according to the severity levels of bugs. However, most of the severity prediction models developed in the literature assumed that the underlying data distribution is evenly distributed, which may not correct at all instances and hence, the aim of this study is to develop bug classification models from unevenly distributed datasets and tested them accordingly. To that end first, the topics or keywords of developer descriptions of bug reports are extracted using Rapid Keyword Extraction (RAKE) algorithm and then transferred them into numerical attributes, which combined with severity levels constructs datasets. These datasets are used to build classification models; Naïve Bayes, Logistic Regression, and Decision Tree Learner algorithms. The models’ prediction quality is measured using Area Under Recursive Operative Characteristics Curves (AUC) as the models learnt from more skewed environments. According to the results, the prediction quality of the Logistics Regression model is 0.65 AUC whereas the other two models recorded maximum 0.60 AUC. Though the datasets contain comparatively less number of instances from the high severity classes; Blocking and High, the Logistic Regression models predict the two classes with a decent AUC value of 0.65 AUC. Hence, this projects shows that the models can be trained from highly skewed datasets so that the models prediction quality is equally well over all the classes regardless of number of instances representing the class. Further, this project emphasizes that the models should be evaluated using the appropriate metrics when the models are trained from imbalance learning environments. Also, this work uncovers that the Logistic Regression model is also capable of classifying documents as Naïve Bayes, which is well known for this task.

Author(s):  
Jayalath Bandara Ekanayake

Manual classification of bug reports is time-consuming as the reports are received in large quantities. Alternatively, this project proposed automatic bug prediction models to classify the bug reports. The topics or the candidate keywords are mined from the developer description in bug reports using RAKE algorithm and converted into attributes. These attributes together with the target attribute—priority level—construct the training datasets. Naïve Bayes, logistic regression, and decision tree learner algorithms are trained, and the prediction quality was measured using area under recursive operative characteristics curves (AUC) as AUC does not consider the biasness in datasets. The logistics regression model outperforms the other two models providing the accuracy of 0.86 AUC whereas the naïve Bayes and the decision tree learner recorded 0.79 AUC and 0.81 AUC, respectively. The bugs can be classified without developer involvement and logistic regression is also a potential candidate as naïve Bayes for bug classification.


2021 ◽  
Vol 22 (Supplement_1) ◽  
Author(s):  
T Heseltine ◽  
SW Murray ◽  
RL Jones ◽  
M Fisher ◽  
B Ruzsics

Abstract Funding Acknowledgements Type of funding sources: None. onbehalf Liverpool Multiparametric Imaging Collaboration Background Coronary artery calcium (CAC) score is a well-established technique for stratifying an individual’s cardiovascular disease (CVD) risk. Several well-established registries have incorporated CAC scoring into CVD risk prediction models to enhance accuracy. Hepatosteatosis (HS) has been shown to be an independent predictor of CVD events and can be measured on non-contrast computed tomography (CT). We sought to undertake a contemporary, comprehensive assessment of the influence of HS on CAC score alongside traditional CVD risk factors. In patients with HS it may be beneficial to offer routine CAC screening to evaluate CVD risk to enhance opportunities for earlier primary prevention strategies. Methods We performed a retrospective, observational analysis at a high-volume cardiac CT centre analysing consecutive CT coronary angiography (CTCA) studies. All patients referred for investigation of chest pain over a 28-month period (June 2014 to November 2016) were included. Patients with established CVD were excluded. The cardiac findings were reported by a cardiologist and retrospectively analysed by two independent radiologists for the presence of HS. Those with CAC of zero and those with CAC greater than zero were compared for demographic and cardiac risks. A multivariate analysis comparing the risk factors was performed to adjust for the presence of established risk factors. A binomial logistic regression model was developed to assess the association between the presence of HS and increasing strata of CAC. Results In total there were 1499 patients referred for CTCA without prior evidence of CVD. The assessment of HS was completed in 1195 (79.7%) and CAC score was performed in 1103 (92.3%). There were 466 with CVD and 637 without CVD. The prevalence of HS was significantly higher in those with CVD versus those without CVD on CTCA (51.3% versus 39.9%, p = 0.007). Male sex (50.7% versus 36.1% p= <0.001), age (59.4 ± 13.7 versus 48.1 ± 13.6, p= <0.001) and diabetes (12.4% versus 6.9%, p = 0.04) were also significantly higher in the CAC group compared to the CAC score of zero. HS was associated with increasing strata of CAC score compared with CAC of zero (CAC score 1-100 OR1.47, p = 0.01, CAC score 101-400 OR:1.68, p = 0.02, CAC score >400 OR 1.42, p = 0.14). This association became non-significant in the highest strata of CAC score. Conclusion We found a significant association between the increasing age, male sex, diabetes and HS with the presence of CAC. HS was also associated with a more severe phenotype of CVD based on the multinomial logistic regression model. Although the association reduced for the highest strata of CAC (CAC score >400) this likely reflects the overall low numbers of patients within this group and is likely a type II error. Based on these findings it may be appropriate to offer routine CVD risk stratification techniques in all those diagnosed with HS.


2015 ◽  
Vol 54 (06) ◽  
pp. 560-567 ◽  
Author(s):  
K. Zhu ◽  
Z. Lou ◽  
J. Zhou ◽  
N. Ballester ◽  
P. Parikh ◽  
...  

SummaryIntroduction: This article is part of the Focus Theme of Methods of Information in Medicine on “Big Data and Analytics in Healthcare”.Background: Hospital readmissions raise healthcare costs and cause significant distress to providers and patients. It is, therefore, of great interest to healthcare organizations to predict what patients are at risk to be readmitted to their hospitals. However, current logistic regression based risk prediction models have limited prediction power when applied to hospital administrative data. Meanwhile, although decision trees and random forests have been applied, they tend to be too complex to understand among the hospital practitioners.Objectives: Explore the use of conditional logistic regression to increase the prediction accuracy.Methods: We analyzed an HCUP statewide in-patient discharge record dataset, which includes patient demographics, clinical and care utilization data from California. We extracted records of heart failure Medicare beneficiaries who had inpatient experience during an 11-month period. We corrected the data imbalance issue with under-sampling. In our study, we first applied standard logistic regression and decision tree to obtain influential variables and derive practically meaning decision rules. We then stratified the original data set accordingly and applied logistic regression on each data stratum. We further explored the effect of interacting variables in the logistic regression modeling. We conducted cross validation to assess the overall prediction performance of conditional logistic regression (CLR) and compared it with standard classification models.Results: The developed CLR models outperformed several standard classification models (e.g., straightforward logistic regression, stepwise logistic regression, random forest, support vector machine). For example, the best CLR model improved the classification accuracy by nearly 20% over the straightforward logistic regression model. Furthermore, the developed CLR models tend to achieve better sensitivity of more than 10% over the standard classification models, which can be translated to correct labeling of additional 400 – 500 readmissions for heart failure patients in the state of California over a year. Lastly, several key predictor identified from the HCUP data include the disposition location from discharge, the number of chronic conditions, and the number of acute procedures.Conclusions: It would be beneficial to apply simple decision rules obtained from the decision tree in an ad-hoc manner to guide the cohort stratification. It could be potentially beneficial to explore the effect of pairwise interactions between influential predictors when building the logistic regression models for different data strata. Judicious use of the ad-hoc CLR models developed offers insights into future development of prediction models for hospital readmissions, which can lead to better intuition in identifying high-risk patients and developing effective post-discharge care strategies. Lastly, this paper is expected to raise the awareness of collecting data on additional markers and developing necessary database infrastructure for larger-scale exploratory studies on readmission risk prediction.


2018 ◽  
Vol 12 (2) ◽  
pp. 119-126 ◽  
Author(s):  
Vikas Chaurasia ◽  
Saurabh Pal ◽  
BB Tiwari

Breast cancer is the second most leading cancer occurring in women compared to all other cancers. Around 1.1 million cases were recorded in 2004. Observed rates of this cancer increase with industrialization and urbanization and also with facilities for early detection. It remains much more common in high-income countries but is now increasing rapidly in middle- and low-income countries including within Africa, much of Asia, and Latin America. Breast cancer is fatal in under half of all cases and is the leading cause of death from cancer in women, accounting for 16% of all cancer deaths worldwide. The objective of this research paper is to present a report on breast cancer where we took advantage of those available technological advancements to develop prediction models for breast cancer survivability. We used three popular data mining algorithms (Naïve Bayes, RBF Network, J48) to develop the prediction models using a large dataset (683 breast cancer cases). We also used 10-fold cross-validation methods to measure the unbiased estimate of the three prediction models for performance comparison purposes. The results (based on average accuracy Breast Cancer dataset) indicated that the Naïve Bayes is the best predictor with 97.36% accuracy on the holdout sample (this prediction accuracy is better than any reported in the literature), RBF Network came out to be the second with 96.77% accuracy, J48 came out third with 93.41% accuracy.


2016 ◽  
Vol 97 ◽  
pp. 141-149 ◽  
Author(s):  
Hui Zhang ◽  
Zhi-Xing Cao ◽  
Meng Li ◽  
Yu-Zhi Li ◽  
Cheng Peng

2020 ◽  
Vol 1641 ◽  
pp. 012061
Author(s):  
Harsih Rianto ◽  
Amrin ◽  
Rudianto ◽  
Omar Pahlevi ◽  
Paramita Kusumawardhani ◽  
...  

2012 ◽  
Author(s):  
Theodore W. Cary ◽  
Alyssa Cwanger ◽  
Santosh S. Venkatesh ◽  
Emily F. Conant ◽  
Chandra M. Sehgal

Sign in / Sign up

Export Citation Format

Share Document