Predicting Aggregated User Satisfaction in Software Projects

Application of predictive modeling approaches is able solve the problem of the missing data. There are a lot of studies that investigate the effects of missing values on qualitative or quantitative modeling, but only few publications have been discussing it in case of applications to nanotechnology related data. Current project aimed at the development of multi-nano-read-across modeling technique that helps in predicting the toxicity of different species: bacteria, algae, protozoa, and mammalian cell lines. In this study, the experimental toxicity for 184 metal- and silica oxides (30 unique chemical types) nanoparticles from 15 experimental datasets was analyzed. A hybrid quantitative multi-nano-read-across approach that combines interspecies correlation analysis and self-organizing map analysis was developed. At the first step, hidden patterns of toxicity among the nanoparticles were identified using a combination of methods. Then the developed model that based on categorization of metal oxide nanoparticles’ toxicity outcomes was evaluated by means of combination of supervised and unsupervised machine learning techniques to find underlying factors responsible for toxicity.

Download Full-text

How toxicity of nanomaterials towards different species could be simultaneously evaluated: Novel multi-nano-read-across approach

10.26434/chemrxiv.5327677 ◽

2017 ◽

Author(s):

Natalia Sizochenko ◽

Alicja Mikolajczyk ◽

Karolina Jagiello ◽

Tomasz Puzyn ◽

Jerzy Leszczynski ◽

...

Keyword(s):

Missing Values ◽

Metal Oxide Nanoparticles ◽

Machine Learning Techniques ◽

Self Organizing Map ◽

Related Data ◽

Mammalian Cell Lines ◽

Learning Techniques ◽

Combination Of Methods ◽

Map Analysis ◽

Unique Chemical

Application of predictive modeling approaches is able solve the problem of the missing data. There are a lot of studies that investigate the effects of missing values on qualitative or quantitative modeling, but only few publications have been discussing it in case of applications to nanotechnology related data. Current project aimed at the development of multi-nano-read-across modeling technique that helps in predicting the toxicity of different species: bacteria, algae, protozoa, and mammalian cell lines. In this study, the experimental toxicity for 184 metal- and silica oxides (30 unique chemical types) nanoparticles from 15 experimental datasets was analyzed. A hybrid quantitative multi-nano-read-across approach that combines interspecies correlation analysis and self-organizing map analysis was developed. At the first step, hidden patterns of toxicity among the nanoparticles were identified using a combination of methods. Then the developed model that based on categorization of metal oxide nanoparticles’ toxicity outcomes was evaluated by means of combination of supervised and unsupervised machine learning techniques to find underlying factors responsible for toxicity.

Download Full-text

Learning from Imbalanced Educational Data Using Ensemble Machine Learning Algorithms

Webology ◽

10.14704/web/v18si01/web18053 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 183-195

Author(s):

Thingbaijam Lenin ◽

N. Chandrasekaran

Keyword(s):

Machine Learning ◽

Random Forest ◽

Missing Values ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Adaptive Boosting ◽

Stochastic Gradient Boosting ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Student’S Performance

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.

Download Full-text

Evolutionary Machine Learning for Classification with Incomplete Data

10.26686/wgtn.17072123 ◽

2021 ◽

Author(s):

◽

Cao Truong Tran

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Genetic Programming ◽

Incomplete Data ◽

Missing Values ◽

Machine Learning Techniques ◽

Feature Construction ◽

Classification Algorithms ◽

Learning Techniques ◽

Effectiveness And Efficiency

Classification is a major task in machine learning and data mining. Many real-world datasets suffer from the unavoidable issue of missing values. Classification with incomplete data has to be carefully handled because inadequate treatment of missing values will cause large classification errors. Existing most researchers working on classification with incomplete data focused on improving the effectiveness, but did not adequately address the issue of the efficiency of applying the classifiers to classify unseen instances, which is much more important than the act of creating classifiers. A common approach to classification with incomplete data is to use imputation methods to replace missing values with plausible values before building classifiers and classifying unseen instances. This approach provides complete data which can be then used by any classification algorithm, but sophisticated imputation methods are usually computationally intensive, especially for the application process of classification. Another approach to classification with incomplete data is to build a classifier that can directly work with missing values. This approach does not require time for estimating missing values, but it often generates inaccurate and complex classifiers when faced with numerous missing values. A recent approach to classification with incomplete data which also avoids estimating missing values is to build a set of classifiers which then is used to select applicable classifiers for classifying unseen instances. However, this approach is also often inaccurate and takes a long time to find applicable classifiers when faced with numerous missing values. The overall goal of the thesis is to simultaneously improve the effectiveness and efficiency of classification with incomplete data by using evolutionary machine learning techniques for feature selection, clustering, ensemble learning, feature construction and constructing classifiers. The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classification with incomplete data. The thesis develops wrapper-based feature selection methods to improve input space for classification algorithms that are able to work directly with incomplete data. The methods not only improve the classification accuracy, but also reduce the complexity of classifiers able to work directly with incomplete data. The thesis develops a feature construction method to improve input space for classification algorithms with incomplete data by proposing interval genetic programming-genetic programming with a set of interval functions. The method improves the classification accuracy and reduces the complexity of classifiers. The thesis develops an ensemble approach to classification with incomplete data by integrating imputation, feature selection, and ensemble learning. The results show that the approach is more accurate, and faster than previous common methods for classification with incomplete data. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to work directly with incomplete data. In summary, the thesis develops a range of approaches for simultaneously improving the effectiveness and efficiency of classification with incomplete data by using a range of evolutionary machine learning techniques.

Download Full-text

Story Point-Based Effort Estimation Model with Machine Learning Techniques

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194020500035 ◽

2020 ◽

Vol 30 (01) ◽

pp. 43-66

Author(s):

Muaz Gultekin ◽

Oya Kalipsiz

Keyword(s):

Machine Learning ◽

Empirical Evaluation ◽

Machine Learning Algorithms ◽

Decision Makers ◽

Machine Learning Techniques ◽

Effort Estimation ◽

Estimation Model ◽

Software Projects ◽

Learning Techniques ◽

Better Than

Until now, numerous effort estimation models for software projects have been developed, most of them producing accurate results but not providing the flexibility to decision makers during the software development process. The main objective of this study is to objectively and accurately estimate the effort when using the Scrum methodology. A dynamic effort estimation model is developed by using regression-based machine learning algorithms. Story point as a unit of measure is used for estimating the effort involved in an issue. Projects are divided into phases and the phases are respectively divided into iterations and issues. Effort estimation is performed for each issue, then the total effort is calculated with aggregate functions respectively for iteration, phase and project. This architecture of our model provides flexibility to decision makers in any case of deviation from the project plan. An empirical evaluation demonstrates that the error rate of our story point-based estimation model is better than others.

Download Full-text

Algebraic Shortcuts for Leave-One-Out Cross-Validation in Supervised Network Inference

10.1101/242321 ◽

2018 ◽

Author(s):

Michiel Stock ◽

Tapio Pahikkala ◽

Antti Airola ◽

Willem Waegeman ◽

Bernard De Baets

Keyword(s):

Machine Learning ◽

Biological Networks ◽

Regulatory Networks ◽

Network Inference ◽

Cross Validation ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Ligand Interaction ◽

Learning Techniques ◽

Leave One Out

AbstractMotivationSupervised machine learning techniques have traditionally been very successful at reconstructing biological networks, such as protein-ligand interaction, protein-protein interaction and gene regulatory networks. Recently, much emphasis has been placed on the correct evaluation of such supervised models. It is vital to distinguish between using the model to either predict new interactions in a given network or to predict interactions for a new vertex not present in the original network. Specific cross-validation schemes need to be used to assess the performance in such different prediction settings.ResultsWe present a series of leave-one-out cross-validation shortcuts to rapidly estimate the performance of state-of-the-art kernel-based network inference techniques.AvailabilityThe machine learning techniques with the algebraic shortcuts are implemented in the RLScore software package.

Download Full-text

I TRIED A BUNCH OF THINGS: THE DANGERS OF UNEXPECTED OVERFITTING IN CLASSIFICATION

10.1101/078816 ◽

2016 ◽

Cited By ~ 13

Author(s):

Michael Powell ◽

Mahan Hosseini ◽

John Collins ◽

Chloe Callahan-Flintoft ◽

William Jones ◽

...

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Machine Learning Techniques ◽

Random Data ◽

Effective Protection ◽

Data Set ◽

Learning Techniques ◽

Original Analysis ◽

Spurious Result ◽

Spurious Results

ABSTRACTMachine learning is a powerful set of techniques that has enhanced the abilities of neuroscientists to interpret information collected through EEG, fMRI, and MEG data. With these powerful techniques comes the danger of overfitting of hyper-parameters which can render results invalid, and cause a failure to generalize beyond the data set. We refer to this problem as ‘over-hyping’ and show that it is pernicious despite commonly used precautions. In particular, over-hyping occurs when an analysis is run repeatedly with slightly different analysis parameters and one set of results is selected based on the analysis. When this is done, the resulting method is unlikely to generalize to a new dataset, rendering it a partially, or perhaps even completely spurious result that will not be valid outside of the data used in the original analysis. While it is commonly assumed that cross-validation is an effective protection against such spurious results generated through overfitting or overhyping, this is not actually true. In this article, we show that both one-shot and iterative optimization of an analysis are prone to over-hyping, despite the use of cross-validation. We demonstrate that non-generalizable results can be obtained even on non-informative (i.e. random) data by modifying hyper-parameters in seemingly innocuous ways. We recommend a number of techniques for limiting over-hyping, such as lock-boxes, blind analyses, pre-registrations, and nested cross-validation. These techniques, are common in other fields that use machine learning, including computer science and physics. Adopting similar safeguards is critical for ensuring the robustness of machine-learning techniques in the neurosciences.

Download Full-text

Classification of Fetal State through the application of Machine Learning techniques on Cardiotocography records: Towards Real World Application.

10.1101/2021.06.03.21255808 ◽

2021 ◽

Author(s):

Andrew M V Dadario ◽

Christian Espinoza ◽

Wellington Araujo Nogueira

Keyword(s):

Machine Learning ◽

Real World ◽

Cross Validation ◽

Low Cost ◽

Gaussian Process Regression ◽

Machine Learning Techniques ◽

Real World Application ◽

Learning Techniques ◽

Qualified Personnel

Objective Anticipating fetal risk is a major factor in reducing child and maternal mortality and suffering. In this context cardiotocography (CTG) is a low cost, well established procedure that has been around for decades, despite lacking consensus regarding its impact on outcomes. Machine learning emerged as an option for automatic classification of CTG records, as previous studies showed expert level results, but often came at the price of reduced generalization potential. With that in mind, the present study sought to improve statistical rigor of evaluation towards real world application. Materials and Methods In this study, a dataset of 2126 CTG recordings labeled as normal, suspect or pathological by the consensus of three expert obstetricians was used to create a baseline random forest model. This was followed by creating a lightgbm model tuned using gaussian process regression and post processed using cross validation ensembling. Performance was assessed using the area under the precision-recall curve (AUPRC) metric over 100 experiment executions, each using a testing set comprised of 30% of data stratified by the class label. Results The best model was a cross validation ensemble of lightgbm models that yielded 95.82% AUPRC. Conclusions The model is shown to produce consistent expert level performance at a less than negligible cost. At an estimated 0.78 USD per million predictions the model can generate value in settings with CTG qualified personnel and all the more in their absence.

Download Full-text

Evolutionary Machine Learning for Classification with Incomplete Data

10.26686/wgtn.17072123.v1 ◽

2021 ◽

Author(s):

◽

Cao Truong Tran

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Genetic Programming ◽

Incomplete Data ◽

Missing Values ◽

Machine Learning Techniques ◽

Feature Construction ◽

Classification Algorithms ◽

Learning Techniques ◽

Effectiveness And Efficiency

Classification is a major task in machine learning and data mining. Many real-world datasets suffer from the unavoidable issue of missing values. Classification with incomplete data has to be carefully handled because inadequate treatment of missing values will cause large classification errors. Existing most researchers working on classification with incomplete data focused on improving the effectiveness, but did not adequately address the issue of the efficiency of applying the classifiers to classify unseen instances, which is much more important than the act of creating classifiers. A common approach to classification with incomplete data is to use imputation methods to replace missing values with plausible values before building classifiers and classifying unseen instances. This approach provides complete data which can be then used by any classification algorithm, but sophisticated imputation methods are usually computationally intensive, especially for the application process of classification. Another approach to classification with incomplete data is to build a classifier that can directly work with missing values. This approach does not require time for estimating missing values, but it often generates inaccurate and complex classifiers when faced with numerous missing values. A recent approach to classification with incomplete data which also avoids estimating missing values is to build a set of classifiers which then is used to select applicable classifiers for classifying unseen instances. However, this approach is also often inaccurate and takes a long time to find applicable classifiers when faced with numerous missing values. The overall goal of the thesis is to simultaneously improve the effectiveness and efficiency of classification with incomplete data by using evolutionary machine learning techniques for feature selection, clustering, ensemble learning, feature construction and constructing classifiers. The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classification with incomplete data. The thesis develops wrapper-based feature selection methods to improve input space for classification algorithms that are able to work directly with incomplete data. The methods not only improve the classification accuracy, but also reduce the complexity of classifiers able to work directly with incomplete data. The thesis develops a feature construction method to improve input space for classification algorithms with incomplete data by proposing interval genetic programming-genetic programming with a set of interval functions. The method improves the classification accuracy and reduces the complexity of classifiers. The thesis develops an ensemble approach to classification with incomplete data by integrating imputation, feature selection, and ensemble learning. The results show that the approach is more accurate, and faster than previous common methods for classification with incomplete data. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to work directly with incomplete data. In summary, the thesis develops a range of approaches for simultaneously improving the effectiveness and efficiency of classification with incomplete data by using a range of evolutionary machine learning techniques.

Download Full-text