An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples

PeerJ Computer Science ◽

10.7717/peerj-cs.671 ◽

2021 ◽

Vol 7 ◽

pp. e671

Author(s):

Shilpi Bose ◽

Chandra Das ◽

Abhik Banerjee ◽

Kuntal Ghosh ◽

Matangini Chattopadhyay ◽

...

Keyword(s):

Machine Learning ◽

Clustering Algorithm ◽

Class Imbalance ◽

Classification Model ◽

Machine Learning Techniques ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Attribute Clustering

Background Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis. Methods In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets. Results To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes.

Download Full-text

Bug Severity Prediction using Class Imbalance Problem

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7297.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 2687-2695

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Machine Learning Techniques ◽

System Level ◽

Class Imbalance Problem ◽

Component Level ◽

Software Bugs ◽

Imbalance Problem ◽

Learning Techniques

Class imbalance problem is often observed when instances of major class exceed instances of minor class. The performance of machine learning techniques is immensely afflicted by imbalanced data in several fields. The skewed distribution either predicts the majority class with high error rate or will not foresee the minority class. To solve the problem of imbalanced data of software bugs, Synthetic minority oversampling technique (SMOTE) is used which balances the imbalanced datasets of Apache Projects. It is applied on summary of bugs to balance the dataset and predicts severity at system and component level. Several machine learning techniques are applied on imbalanced as well as balanced datasets to predict the severity of software bugs using textual description. Test outcomes and statistical analysis shows improved results on balanced datasets in respect to Gmean and balance metrics instead of machine learning techniques applied on imbalanced data. Evaluation metrics Gmean improves by 34% and balance by 11% at system level and by 42% and 62% at component level. Further, it was observed that solving class imbalance problem on textual data is helpful in augmenting the performance.

Download Full-text

A Leaf Disease Classification Model in Betel Vine Using Machine Learning Techniques

2021 2nd International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST) ◽

10.1109/icrest51555.2021.9331142 ◽

2021 ◽

Author(s):

Md Zahid Hasan ◽

Nahid Zeba ◽

Md. Abdul Malek ◽

Sanjida Sultana Reya

Keyword(s):

Machine Learning ◽

Disease Classification ◽

Classification Model ◽

Machine Learning Techniques ◽

Leaf Disease ◽

Learning Techniques

Download Full-text

Prediction of Clinical Risk Factors of Diabetes Using Multiple Machine Learning Techniques Resolving Class Imbalance

2020 23rd International Conference on Computer and Information Technology (ICCIT) ◽

10.1109/iccit51783.2020.9392694 ◽

2020 ◽

Author(s):

Kazi Amit Hasan ◽

Md. Al Mehedi Hasan

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Class Imbalance ◽

Clinical Risk Factors ◽

Machine Learning Techniques ◽

Clinical Risk ◽

Learning Techniques

Download Full-text

Identification with machine learning techniques of a classification model for the degree of damage to rubber-textile conveyor belts with the aim to achieve sustainability

Engineering Failure Analysis ◽

10.1016/j.engfailanal.2021.105564 ◽

2021 ◽

pp. 105564

Author(s):

Andrejiova Miriam ◽

Anna Grincova ◽

Daniela Marasova

Keyword(s):

Machine Learning ◽

Classification Model ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Conveyor Belts ◽

Degree Of Damage

Download Full-text

A Macrocause Classification Model for Violent Crime Analysis in the Field of Public Safety Based on Machine Learning Techniques

10.1109/isc253183.2021.9562842 ◽

2021 ◽

Author(s):

Ramiro de Vasconcelos dos Santos Junior ◽

Joao Vitor Venceslau Coelho ◽

Nelio Alessandro Azevedo Cacho

Keyword(s):

Machine Learning ◽

Violent Crime ◽

Public Safety ◽

Classification Model ◽

Machine Learning Techniques ◽

Crime Analysis ◽

Learning Techniques

Download Full-text

Class Imbalance Issue in Software Defect Prediction Models by various Machine Learning Techniques: An Empirical Study

10.1109/icscc51209.2021.9528170 ◽

2021 ◽

Author(s):

Sushant Kumar Pandey ◽

Anil Kumar Tripathi

Keyword(s):

Machine Learning ◽

Empirical Study ◽

Prediction Models ◽

Class Imbalance ◽

Machine Learning Techniques ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Learning Techniques ◽

Defect Prediction Models

Download Full-text

Learning from Imbalanced Educational Data Using Ensemble Machine Learning Algorithms

Webology ◽

10.14704/web/v18si01/web18053 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 183-195

Author(s):

Thingbaijam Lenin ◽

N. Chandrasekaran

Keyword(s):

Machine Learning ◽

Random Forest ◽

Missing Values ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Adaptive Boosting ◽

Stochastic Gradient Boosting ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Student’S Performance

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.

Download Full-text

Identifying student behavior in MOOCs using Machine Learning

International Journal for Innovation Education and Research ◽

10.31686/ijier.vol7.iss3.1318 ◽

2019 ◽

Vol 7 (3) ◽

pp. 30-39 ◽

Cited By ~ 1

Author(s):

Vanessa Faria De Souza ◽

Gabriela Perry

Keyword(s):

Machine Learning ◽

Literature Review ◽

Student Behavior ◽

Class Imbalance ◽

External Factors ◽

Class Imbalance Problem ◽

Data Manipulation ◽

Imbalance Problem ◽

Student Classification

This paper presents the results literature review, carried out with the objective of identifying prevalent research goals and challenges in the prediction of student behavior in MOOCs, using Machine Learning. The results allowed recognizingthree goals: 1. Student Classification and 2. Dropout prediction. Regarding the challenges, five items were identified: 1. Incompatibility of AVAs, 2. Complexity of data manipulation, 3. Class Imbalance Problem, 4. Influence of External Factors and 5. Difficulty in manipulating data by untrained personnel.

Download Full-text

Improving Logging Prediction on Imbalanced Datasets

International Journal of Open Source Software and Processes ◽

10.4018/ijossp.2016040103 ◽

2016 ◽

Vol 7 (2) ◽

pp. 43-71 ◽

Cited By ~ 3

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Open Source ◽

Class Imbalance ◽

Learning Model ◽

Learning Models ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Imbalance Problem ◽

Machine Learning Model ◽

Machine Learning Models

Logging is an important yet tough decision for OSS developers. Machine-learning models are useful in improving several steps of OSS development, including logging. Several recent studies propose machine-learning models to predict logged code construct. The prediction performances of these models are limited due to the class-imbalance problem since the number of logged code constructs is small as compared to non-logged code constructs. No previous study analyzes the class-imbalance problem for logged code construct prediction. The authors first analyze the performances of J48, RF, and SVM classifiers for catch-blocks and if-blocks logged code constructs prediction on imbalanced datasets. Second, the authors propose LogIm, an ensemble and threshold-based machine-learning model. Third, the authors evaluate the performance of LogIm on three open-source projects. On average, LogIm model improves the performance of baseline classifiers, J48, RF, and SVM, by 7.38%, 9.24%, and 4.6% for catch-blocks, and 12.11%, 14.95%, and 19.13% for if-blocks logging prediction.

Download Full-text

Non-Intrusive Load Monitoring of Residential Water-Heating Circuit Using Ensemble Machine Learning Techniques

Inventions ◽

10.3390/inventions5040057 ◽

2020 ◽

Vol 5 (4) ◽

pp. 57

Author(s):

Attique Ur Rehman ◽

Tek Tjing Lie ◽

Brice Vallès ◽

Shafiqur Rahman Tito

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Machine Learning Techniques ◽

Learning Models ◽

Water Heating ◽

Energy Monitoring ◽

Non Invasive ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Load Monitoring

The recent advancement in computational capabilities and deployment of smart meters have caused non-intrusive load monitoring to revive itself as one of the promising techniques of energy monitoring. Toward effective energy monitoring, this paper presents a non-invasive load inference approach assisted by feature selection and ensemble machine learning techniques. For evaluation and validation purposes of the proposed approach, one of the major residential load elements having solid potential toward energy efficiency applications, i.e., water heating, is considered. Moreover, to realize the real-life deployment, digital simulations are carried out on low-sampling real-world load measurements: New Zealand GREEN Grid Database. For said purposes, MATLAB and Python (Scikit-Learn) are used as simulation tools. The employed learning models, i.e., standalone and ensemble, are trained on a single household’s load data and later tested rigorously on a set of diverse households’ load data, to validate the generalization capability of the employed models. This paper presents a comprehensive performance evaluation of the presented approach in the context of event detection, feature selection, and learning models. Based on the presented study and corresponding analysis of the results, it is concluded that the proposed approach generalizes well to the unseen testing data and yields promising results in terms of non-invasive load inference.

Download Full-text