Performance tuning for machine learning-based software development effort prediction models

Software development effort estimation is a critical activity of the project management process. In this study, machine learning algorithms were investigated in conjunction with feature transformation, feature selection, and parameter tuning techniques to estimate the development effort accurately and a new model was proposed as part of an expert system. We preferred the most general-purpose algorithms, applied parameter optimization technique (Grid- Search), feature transformation techniques (binning and one-hot-encoding), and feature selection algorithm (principal component analysis). All the models were trained on the ISBSG datasets and implemented by using the scikit-learn package in the Python language. The proposed model uses a multilayer perceptron as its underlying algorithm, applies binning of the features to transform continuous features and one-hot-encoding technique to transform categorical data into numerical values as feature transformation techniques, does feature selection based on the principal component analysis method, and performs parameter tuning based on the GridSearch algorithm. We demonstrate that our effort prediction model mostly outperforms the other existing models in terms of prediction accuracy based on the mean absolute residual parameter.

Download Full-text

Multivariate Analysis and Machine Learning for Ripeness Classification of Cape Gooseberry Fruits

Processes ◽

10.3390/pr7120928 ◽

2019 ◽

Vol 7 (12) ◽

pp. 928 ◽

Cited By ~ 2

Author(s):

Miguel De-la-Torre ◽

Omar Zatarain ◽

Himer Avila-George ◽

Mirna Muñoz ◽

Jimy Oblitas ◽

...

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Feature Selection ◽

Principal Component ◽

Component Analysis ◽

Support Vector ◽

Color Spaces ◽

Combination Methods ◽

Fruit Samples ◽

Cape Gooseberry

This paper explores five multivariate techniques for information fusion on sorting the visual ripeness of Cape gooseberry fruits (principal component analysis, linear discriminant analysis, independent component analysis, eigenvector centrality feature selection, and multi-cluster feature selection.) These techniques are applied to the concatenated channels corresponding to red, green, and blue (RGB), hue, saturation, value (HSV), and lightness, red/green value, and blue/yellow value (L*a*b) color spaces (9 features in total). Machine learning techniques have been reported for sorting the Cape gooseberry fruits’ ripeness. Classifiers such as neural networks, support vector machines, and nearest neighbors discriminate on fruit samples using different color spaces. Despite the color spaces being equivalent up to a transformation, a few classifiers enable better performances due to differences in the pixel distribution of samples. Experimental results show that selection and combination of color channels allow classifiers to reach similar levels of accuracy; however, combination methods still require higher computational complexity. The highest level of accuracy was obtained using the seven-dimensional principal component analysis feature space.

Download Full-text

Kombinasi Feature Selection Fisher Score dan Principal Component Analysis (PCA) untuk Klasifikasi Cervix Dysplasia

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2020702987 ◽

2020 ◽

Vol 7 (3) ◽

pp. 565

Author(s):

Krisan Aprian Widagdo ◽

Kusworo Adi ◽

Rahmat Gernowo

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Feature Selection ◽

Cross Validation ◽

Pap Smear ◽

Principal Component ◽

Component Analysis ◽

Fisher Score ◽

Fold Cross Validation ◽

Cervix Dysplasia

Pengamatan citra Pap Smear merupakan langkah yang sangat penting dalam mendiagnosis awal terhadap gangguan servik. Pengamatan tersebut membutuhkan sumber daya yang besar. Dalam hal ini machine learning dapat mengatasi masalah tersebut. Akan tetapi, keakuratan machine learning bergantung pada fitur yang digunakan. Hanya fitur relevan dan diskriminatif yang mampu memberikan hasil klasifikasi akurat. Pada penelitian ini menggabungkan Fisher Score dan Principal Component Analysis (PCA). Pertama Fisher Score memilih fitur relevan berdasarkan perangkingan. Langkah selanjutnya PCA mentransformasikan kandidat fitur menjadi dataset baru yang tidak saling berkorelasi. Metode jaringan syaraf tiruan Backpropagation digunakan untuk mengevaluasi performa kombinasi Fisher Score dan PCA. Model dievaluasi dengan metode 5 fold cross validation. Selain itu kombinasi ini dibandingkan dengan model fitur asli dan model fitur hasil Fscore. Hasil percobaan menunjukkan kombinasi fisher score dan PCA menghasilkan performa terbaik (akurasi 0.964±0.006, Sensitivity 0.990±0.005 dan Specificity 0.889±0.009). Dari segi waktu komputasi, kombinasi Fisher Score dan PCA membutuhkan waktu relative cepat. Penelitian ini membuktikan bahwa penggunaan feature selection dan feature extraction mampu meningkatkan kinerja klasifikasi dengan waktu yang relative singkat. Abstract Examination Pap Smear images is an important step to early diagnose cervix dysplasia. It needs a lot of resources. In this case, Machine Learning can solve this problem. However, Machine learning depends on the features used. Only relevant and discriminant features can provide an accurate classification result. In this work, combining feature selection Fisher Score (FScore) and Principal Component Analysis (PCA) is applied. First, FScore selects relevant features based on rangking score. And then PCA transforms candidate features into a new uncorrelated dataset. Artificial Neural Network Backpropagation used to evaluate performance combination FScore PCA. The model evaluated with 5 fold cross validation. The other hand, this combination compared with original features model and FScore model. Experimental result shows the combination of Fscore PCA produced the best performance (Accuracy 0.964±0.006, Sensitivity 0.990±0.005 and Specificity 0.889±0.009). In term of computational time, this combination needed a reasonable time. In this work, it was proved that applying feature selection and feature extraction could improve performance classification with a promising time.

Download Full-text

Classification of Observations through Combination of the Dimension Reduction and the Cluster Analysis

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i8.13 ◽

2017 ◽

Vol 7 (8) ◽

pp. 30

Author(s):

Hyeuk Kim

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Cluster Analysis ◽

Unsupervised Learning ◽

Principal Component ◽

Component Analysis ◽

Baseball Players ◽

Partitioning Around Medoids ◽

Different Characteristics

Unsupervised learning in machine learning divides data into several groups. The observations in the same group have similar characteristics and the observations in the different groups have the different characteristics. In the paper, we classify data by partitioning around medoids which have some advantages over the k-means clustering. We apply it to baseball players in Korea Baseball League. We also apply the principal component analysis to data and draw the graph using two components for axis. We interpret the meaning of the clustering graphically through the procedure. The combination of the partitioning around medoids and the principal component analysis can be used to any other data and the approach makes us to figure out the characteristics easily.

Download Full-text

Analysis of the Bath Motion in the MM-SQC Dynamics Using Unsupervised Machine Learning Dimensionality Reduction Approaches: Principal Component Analysis

10.26434/chemrxiv.13332530 ◽

2020 ◽

Author(s):

Jiawei Peng ◽

Yu Xie ◽

Deping Hu ◽

Zhenggang Lan

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Collective Motion ◽

Principal Component ◽

Component Analysis ◽

Nonadiabatic Dynamics ◽

Trajectory Data ◽

Unsupervised Machine Learning ◽

Physical Knowledge ◽

Vibronic Couplings

The system-plus-bath model is an important tool to understand nonadiabatic dynamics for large molecular systems. The understanding of the collective motion of a huge number of bath modes is essential to reveal their key roles in the overall dynamics. We apply the principal component analysis (PCA) to investigate the bath motion based on the massive data generated from the MM-SQC (symmetrical quasi-classical dynamics method based on the Meyer-Miller mapping Hamiltonian) nonadiabatic dynamics of the excited-state energy transfer dynamics of Frenkel-exciton model. The PCA method clearly clarifies that two types of bath modes, which either display the strong vibronic couplings or have the frequencies close to electronic transition, are very important to the nonadiabatic dynamics. These observations are fully consistent with the physical insights. This conclusion is obtained purely based on the PCA understanding of the trajectory data, without the large involvement of pre-defined physical knowledge. The results show that the PCA approach, one of the simplest unsupervised machine learning methods, is very powerful to analyze the complicated nonadiabatic dynamics in condensed phase involving many degrees of freedom.

Download Full-text

Feature Selection for Classification using Principal Component Analysis and Information Gain

Expert Systems with Applications ◽

10.1016/j.eswa.2021.114765 ◽

2021 ◽

Vol 174 ◽

pp. 114765 ◽

Cited By ~ 1

Author(s):

Erick Odhiambo Omuya ◽

George Onyango Okeyo ◽

Michael Waema Kimwele

Keyword(s):

Principal Component Analysis ◽

Feature Selection ◽

Information Gain ◽

Principal Component ◽

Component Analysis ◽

Selection For

Download Full-text

Modified Principal Component Analysis (MPCA) for feature selection of hyperspectral imagery

IGARSS 2003. 2003 IEEE International Geoscience and Remote Sensing Symposium. Proceedings (IEEE Cat. No.03CH37477) ◽

10.1109/igarss.2003.1295268 ◽

2004 ◽

Author(s):

Cheng Wang ◽

M. Menenti ◽

Zhao-Liang Li

Keyword(s):

Principal Component Analysis ◽

Feature Selection ◽

Hyperspectral Imagery ◽

Principal Component ◽

Component Analysis ◽

Selection Of

Download Full-text

Machine learning approaches to estimating software development effort

IEEE Transactions on Software Engineering ◽

10.1109/32.345828 ◽

1995 ◽

Vol 21 (2) ◽

pp. 126-137 ◽

Cited By ~ 259

Author(s):

K. Srinivasan ◽

D. Fisher

Keyword(s):

Machine Learning ◽

Software Development ◽

Development Effort ◽

Learning Approaches ◽

Software Development Effort

Download Full-text

APPLYING EXPERT JUDGMENT TO IMPROVE AN INDIVIDUAL'S ABILITY TO PREDICT SOFTWARE DEVELOPMENT EFFORT

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194012500118 ◽

2012 ◽

Vol 22 (04) ◽

pp. 467-483 ◽

Cited By ~ 9

Author(s):

CUAUHTÉMOC LÓPEZ-MARTÍN ◽

ALAIN ABRAN

Keyword(s):

Software Development ◽

Software Process ◽

Development Effort ◽

Academic Environment ◽

Academic Institutions ◽

Graduate Courses ◽

Software Projects ◽

Personal Software Process ◽

Software Development Effort ◽

Effort Prediction

Expert-based effort prediction in software projects can be taught, beginning with the practices learned in an academic environment in courses designed to encourage them. However, the length of such courses is a major concern for both industry and academia. Industry has to work without its employees while they are taking such a course, and academic institutions find it hard to fit the course into an already tight schedule. In this research, the set of Personal Software Process (PSP) practices is reordered and the practices are distributed among fewer assignments, in an attempt to address these concerns. This study involved 148 practitioners taking graduate courses who developed 1,036 software course assignments. The hypothesis on which it is based is the following: When the activities in the original PSP set are reordered into fewer assignments, the result is expert-based effort prediction that is statistically significantly better.

Download Full-text