A Contemporary Machine Learning Method for Accurate Prediction of Cervical Cancer

Feature Construction ◽

Classification Algorithms ◽

Learning Techniques ◽

Effectiveness And Efficiency

<p>Classification is a major task in machine learning and data mining. Many real-world datasets suffer from the unavoidable issue of missing values. Classification with incomplete data has to be carefully handled because inadequate treatment of missing values will cause large classification errors. Existing most researchers working on classification with incomplete data focused on improving the effectiveness, but did not adequately address the issue of the efficiency of applying the classifiers to classify unseen instances, which is much more important than the act of creating classifiers. A common approach to classification with incomplete data is to use imputation methods to replace missing values with plausible values before building classifiers and classifying unseen instances. This approach provides complete data which can be then used by any classification algorithm, but sophisticated imputation methods are usually computationally intensive, especially for the application process of classification. Another approach to classification with incomplete data is to build a classifier that can directly work with missing values. This approach does not require time for estimating missing values, but it often generates inaccurate and complex classifiers when faced with numerous missing values. A recent approach to classification with incomplete data which also avoids estimating missing values is to build a set of classifiers which then is used to select applicable classifiers for classifying unseen instances. However, this approach is also often inaccurate and takes a long time to find applicable classifiers when faced with numerous missing values. The overall goal of the thesis is to simultaneously improve the effectiveness and efficiency of classification with incomplete data by using evolutionary machine learning techniques for feature selection, clustering, ensemble learning, feature construction and constructing classifiers. The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classification with incomplete data. The thesis develops wrapper-based feature selection methods to improve input space for classification algorithms that are able to work directly with incomplete data. The methods not only improve the classification accuracy, but also reduce the complexity of classifiers able to work directly with incomplete data. The thesis develops a feature construction method to improve input space for classification algorithms with incomplete data by proposing interval genetic programming-genetic programming with a set of interval functions. The method improves the classification accuracy and reduces the complexity of classifiers. The thesis develops an ensemble approach to classification with incomplete data by integrating imputation, feature selection, and ensemble learning. The results show that the approach is more accurate, and faster than previous common methods for classification with incomplete data. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to work directly with incomplete data. In summary, the thesis develops a range of approaches for simultaneously improving the effectiveness and efficiency of classification with incomplete data by using a range of evolutionary machine learning techniques.</p>

Improved argumentative paragraphs detection in academic theses supported with unit segmentation

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219237 ◽

2021 ◽

pp. 1-11

Author(s):

Jesús Miguel García-Gorrostieta ◽

Aurelio López-López ◽

Samuel González-López ◽

Adrián Pastor López-Monroy

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Automatic Detection ◽

Svm Classifier ◽

Complex Task ◽

Learning Techniques ◽

Tree Classifier ◽

Academic Author

Academic theses writing is a complex task that requires the author to be skilled in argumentation. The goal of the academic author is to communicate clear ideas and to convince the reader of the presented claims. However, few students are good arguers, and this is a skill that takes time to master. In this paper, we present an exploration of lexical features used to model automatic detection of argumentative paragraphs using machine learning techniques. We present a novel proposal, which combines the information in the complete paragraph with the detection of argumentative segments in order to achieve improved results for the detection of argumentative paragraphs. We propose two approaches; a more descriptive one, which uses the decision tree classifier with indicators and lexical features; and another more efficient, which uses an SVM classifier with lexical features and a Document Occurrence Representation (DOR). Both approaches consider the detection of argumentative segments to ensure that a paragraph detected as argumentative has indeed segments with argumentation. We achieved encouraging results for both approaches.

On Internet Traffic Classification: A Two-Phased Machine Learning Approach

Journal of Computer Networks and Communications ◽

10.1155/2016/2048302 ◽

2016 ◽

Vol 2016 ◽

pp. 1-21 ◽

Cited By ~ 21

Author(s):

Taimur Bakhshi ◽

Bogdan Ghita

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Traffic Classification ◽

Internet Applications ◽

Adaptive Boosting ◽

Machine Learning Classification ◽

Computational Performance ◽

Traffic classification utilizing flow measurement enables operators to perform essential network management. Flow accounting methods such as NetFlow are, however, considered inadequate for classification requiring additional packet-level information, host behaviour analysis, and specialized hardware limiting their practical adoption. This paper aims to overcome these challenges by proposing two-phased machine learning classification mechanism with NetFlow as input. The individual flow classes are derived per application throughk-means and are further used to train a C5.0 decision tree classifier. As part of validation, the initial unsupervised phase used flow records of fifteen popular Internet applications that were collected and independently subjected tok-means clustering to determine unique flow classes generated per application. The derived flow classes were afterwards used to train and test a supervised C5.0 based decision tree. The resulting classifier reported an average accuracy of 92.37% on approximately 3.4 million test cases increasing to 96.67% with adaptive boosting. The classifier specificity factor which accounted for differentiating content specific from supplementary flows ranged between 98.37% and 99.57%. Furthermore, the computational performance and accuracy of the proposed methodology in comparison with similar machine learning techniques lead us to recommend its extension to other applications in achieving highly granular real-time traffic classification.

Evolutionary Machine Learning for Classification with Incomplete Data

10.26686/wgtn.17072123.v1 ◽

2021 ◽

Author(s):

◽

Cao Truong Tran

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Genetic Programming ◽

Incomplete Data ◽

Missing Values ◽

Feature Construction ◽

Classification Algorithms ◽

Learning Techniques ◽

Effectiveness And Efficiency

<p>Classification is a major task in machine learning and data mining. Many real-world datasets suffer from the unavoidable issue of missing values. Classification with incomplete data has to be carefully handled because inadequate treatment of missing values will cause large classification errors. Existing most researchers working on classification with incomplete data focused on improving the effectiveness, but did not adequately address the issue of the efficiency of applying the classifiers to classify unseen instances, which is much more important than the act of creating classifiers. A common approach to classification with incomplete data is to use imputation methods to replace missing values with plausible values before building classifiers and classifying unseen instances. This approach provides complete data which can be then used by any classification algorithm, but sophisticated imputation methods are usually computationally intensive, especially for the application process of classification. Another approach to classification with incomplete data is to build a classifier that can directly work with missing values. This approach does not require time for estimating missing values, but it often generates inaccurate and complex classifiers when faced with numerous missing values. A recent approach to classification with incomplete data which also avoids estimating missing values is to build a set of classifiers which then is used to select applicable classifiers for classifying unseen instances. However, this approach is also often inaccurate and takes a long time to find applicable classifiers when faced with numerous missing values. The overall goal of the thesis is to simultaneously improve the effectiveness and efficiency of classification with incomplete data by using evolutionary machine learning techniques for feature selection, clustering, ensemble learning, feature construction and constructing classifiers. The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classification with incomplete data. The thesis develops wrapper-based feature selection methods to improve input space for classification algorithms that are able to work directly with incomplete data. The methods not only improve the classification accuracy, but also reduce the complexity of classifiers able to work directly with incomplete data. The thesis develops a feature construction method to improve input space for classification algorithms with incomplete data by proposing interval genetic programming-genetic programming with a set of interval functions. The method improves the classification accuracy and reduces the complexity of classifiers. The thesis develops an ensemble approach to classification with incomplete data by integrating imputation, feature selection, and ensemble learning. The results show that the approach is more accurate, and faster than previous common methods for classification with incomplete data. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to work directly with incomplete data. In summary, the thesis develops a range of approaches for simultaneously improving the effectiveness and efficiency of classification with incomplete data by using a range of evolutionary machine learning techniques.</p>

Intelligent Machine Learning Approach for Effective Recognition of Diabetes in E-Healthcare Using Clinical Data

Sensors ◽

10.3390/s20092649 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2649 ◽

Cited By ~ 5

Author(s):

Amin Ul Haq ◽

Jian Ping Li ◽

Jalaluddin Khan ◽

Muhammad Hammad Memon ◽

Shah Nazir ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Decision Tree ◽

Characteristic Curve ◽

Experimental Results ◽

Filter Method ◽

Data Set ◽

Diagnosis System ◽

Healthcare Environment

Significant attention has been paid to the accurate detection of diabetes. It is a big challenge for the research community to develop a diagnosis system to detect diabetes in a successful way in the e-healthcare environment. Machine learning techniques have an emerging role in healthcare services by delivering a system to analyze the medical data for diagnosis of diseases. The existing diagnosis systems have some drawbacks, such as high computation time, and low prediction accuracy. To handle these issues, we have proposed a diagnosis system using machine learning methods for the detection of diabetes. The proposed method has been tested on the diabetes data set which is a clinical dataset designed from patient’s clinical history. Further, model validation methods, such as hold out, K-fold, leave one subject out and performance evaluation metrics, includes accuracy, specificity, sensitivity, F1-score, receiver operating characteristic curve, and execution time have been used to check the validity of the proposed system. We have proposed a filter method based on the Decision Tree (Iterative Dichotomiser 3) algorithm for highly important feature selection. Two ensemble learning algorithms, Ada Boost and Random Forest, are also used for feature selection and we also compared the classifier performance with wrapper based feature selection algorithms. Classifier Decision Tree has been used for the classification of healthy and diabetic subjects. The experimental results show that the proposed feature selection algorithm selected features improve the classification performance of the predictive model and achieved optimal accuracy. Additionally, the proposed system performance is high compared to the previous state-of-the-art methods. High performance of the proposed method is due to the different combinations of selected features set and Plasma glucose concentrations, Diabetes pedigree function, and Blood mass index are more significantly important features in the dataset for prediction of diabetes. Furthermore, the experimental results statistical analysis demonstrated that the proposed method would effectively detect diabetes and can be deployed in an e-healthcare environment.

A Hybrid Supervised Machine Learning Classifier System for Breast Cancer Prognosis Using Feature Selection and Data Imbalance Handling Approaches

Electronics ◽

10.3390/electronics10060699 ◽

2021 ◽

Vol 10 (6) ◽

pp. 699

Author(s):

Yogendra Singh Solanki ◽

Prasun Chakrabarti ◽

Michal Jasinski ◽

Zbigniew Leonowicz ◽

Vadim Bolshev ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Feature Selection ◽

Decision Tree ◽

Search Algorithm ◽

Cancer Prognosis ◽

Support Vector ◽

Kappa Statistics ◽

Genetic Search ◽

Nowadays, breast cancer is the most frequent cancer among women. Early detection is a critical issue that can be effectively achieved by machine learning (ML) techniques. Thus in this article, the methods to improve the accuracy of ML classification models for the prognosis of breast cancer are investigated. Wrapper-based feature selection approach along with nature-inspired algorithms such as Particle Swarm Optimization, Genetic Search, and Greedy Stepwise has been used to identify the important features. On these selected features popular machine learning classifiers Support Vector Machine, J48 (C4.5 Decision Tree Algorithm), Multilayer-Perceptron (a feed-forward ANN) were used in the system. The methodology of the proposed system is structured into five stages which include (1) Data Pre-processing; (2) Data imbalance handling; (3) Feature Selection; (4) Machine Learning Classifiers; (5) classifier’s performance evaluation. The dataset under this research experimentation is referred from the UCI Machine Learning Repository, named Breast Cancer Wisconsin (Diagnostic) Data Set. This article indicated that the J48 decision tree classifier is the appropriate machine learning-based classifier for optimum breast cancer prognosis. Support Vector Machine with Particle Swarm Optimization algorithm for feature selection achieves the accuracy of 98.24%, MCC = 0.961, Sensitivity = 99.11%, Specificity = 96.54%, and Kappa statistics of 0.9606. It is also observed that the J48 Decision Tree classifier with the Genetic Search algorithm for feature selection achieves the accuracy of 98.83%, MCC = 0.974, Sensitivity = 98.95%, Specificity = 98.58%, and Kappa statistics of 0.9735. Furthermore, Multilayer Perceptron ANN classifier with Genetic Search algorithm for feature selection achieves the accuracy of 98.59%, MCC = 0.968, Sensitivity = 98.6%, Specificity = 98.57%, and Kappa statistics of 0.9682.

Evaluation of Prevalence of the Sarcopenia Level Using Machine Learning Techniques: Case Study in Tijuana Baja California, Mexico

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph17061917 ◽

2020 ◽

Vol 17 (6) ◽

pp. 1917 ◽

Cited By ~ 4

Author(s):

Cristián Castillo-Olea ◽

Begonya Garcia-Zapirain Soto ◽

Clemente Zuñiga

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Nutritional Assessment ◽

Mini Nutritional Assessment ◽

Support Vector ◽

Learning Techniques ◽

Tree Classifier ◽

Vector Machines

The article presents a study based on timeline data analysis of the level of sarcopenia in older patients in Baja California, Mexico. Information was examined at the beginning of the study (first event), three months later (second event), and six months later (third event). Sarcopenia is defined as the loss of muscle mass quality and strength. The study was conducted with 166 patients. A total of 65% were women and 35% were men. The mean age of the enrolled patients was 77.24 years. The research included 99 variables that consider medical history, pharmacology, psychological tests, comorbidity (Charlson), functional capacity (Barthel and Lawton), undernourishment (mini nutritional assessment (MNA) validated test), as well as biochemical and socio-demographic data. Our aim was to evaluate the prevalence of the level of sarcopenia in a population of chronically ill patients assessed at the Tijuana General Hospital. We used machine learning techniques to assess and identify the determining variables to focus on the patients’ evolution. The following classifiers were used: Support Vector Machines, Linear Support Vector Machines, Radial Basis Function, Gaussian process, Decision Tree, Random Forest, multilayer perceptron, AdaBoost, Gaussian Naive Bayes, and Quadratic Discriminant Analysis. In order of importance, we found that the following variables determine the level of sarcopenia: Age, Systolic arterial hypertension, mini nutritional assessment (MNA), Number of chronic diseases, and Sodium. They are therefore considered relevant in the decision-making process of choosing treatment or prevention. Analysis of the relationship between the presence of the variables and the classifiers used to measure sarcopenia revealed that the Decision Tree classifier, with the Age, Systolic arterial hypertension, MNA, Number of chronic diseases, and Sodium variables, showed a precision of 0.864, accuracy of 0.831, and an F1 score of 0.900 in the first and second events. Precision of 0.867, accuracy of 0.825, and an F1 score of 0.867 were obtained in event three with the same variables. We can therefore conclude that the Decision Tree classifier yields the best results for the assessment of the determining variables and suggests that the study population’s sarcopenia did not change from moderate to severe.

Quantifying Sensitivity and Performance Degradation of Virtual Machines Using Machine Learning

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9019 ◽

2020 ◽

Vol 17 (9) ◽

pp. 4055-4060

Author(s):

L. Girish ◽

Sridhar K. N. Rao

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Virtual Machine ◽

Resource Sharing ◽

Virtual Machines ◽

Performance Degradation ◽

Tree Classifier ◽

Efficient Resource

Virtualized data centers bring lot of benefits with respect to the reducing the high usage of physical hardware. But nowadays, as the usage of cloud infrastructures are rapidly increasing in all the fields to provide proper services on demand. In cloud data center, achieving efficient resource sharing between virtual machine and physical machines are very important. To achieve efficient resource sharing performance degradation of virtual machine and quantifying the sensitivity of virtual machine must be modeled, predicted correctly. In this work we use machine learning techniques like decision tree, K nearest neighbor and logistic regression to calculate the sensitivity of virtual machine. The dataset used for the experiment was collected using collected from open stack cloud environment. We execute two scenarios in this experiment to evaluate performance of the three mentioned classifiers based on precision, recall, sensitivity and specificity. We achieved good results using decision tree classifier with precision 88.8%, recall 80% and accuracy of 97.30%.

Interpretable machine learning approach for predicting COVID-19 risk status of an individual

Transactions on Networks and Communications ◽

10.14738/tnc.92.9760 ◽

2021 ◽

Vol 9 (2) ◽

pp. 1-14

Author(s):

Anthony Onoja ◽

Mary Oyinlade Ejiwale ◽

Ayesan Rewane

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Decision Tree ◽

Correlation Coefficient ◽

Pearson Correlation ◽

Accuracy Score ◽

Risk Status ◽

Interpretable Machine Learning ◽

This study aimed to ascertained using Statistical feature selection methods and interpretable Machine learning models, the best features that predict risk status (“Low”, “Medium”, “High”) to COVID-19 infection. This study utilizes a publicly available dataset obtained via; online web-based risk assessment calculator to ascertain the risk status of COVID-19 infection. 57 out of 59 features were first filtered for multicollinearity using the Pearson correlation coefficient and further shrunk to 55 features by the LASSO GLM approach. SMOTE resampling technique was used to incur the problem of imbalanced class distribution. The interpretable ML algorithms were implored during the classification phase. The best classifier predictions were saved as a new instance and perturbed using a single Decision tree classifier. To further build trust and explainability of the best model, the XGBoost classifier was used as a global surrogate model to train predictions of the best model. The XGBoost individual’s explanation was done using the SHAP explainable AI-framework. Random Forest classifier with a validation accuracy score of 96.35 % from 55 features reduced by feature selection emerged as the best classifier model. The decision tree classifier approximated the best classifier correctly with a prediction accuracy score of 92.23 % and Matthew’s correlation coefficient of 0.8960. The XGBoost classifier approximated the best classifier model with a prediction score of 99.7 %. This study identified COVID-19 positive, COVID-19 contacts, COVID-19 symptoms, Health workers, and Public transport count as the five most consistent features that predict an individual’s risk exposure to COVID-19.

Performance Evaluation of Intrusion Detection System using Selected Features and Machine Learning Classifiers

Baghdad Science Journal ◽

10.21123/bsj.2021.18.2(suppl.).0884 ◽

2021 ◽

Vol 18 (2(Suppl.)) ◽

pp. 0884

Author(s):

Raja Azlina Raja Mahmood ◽

AmirHossien Abdi ◽

Masnida Hussin

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Intrusion Detection ◽

Decision Tree ◽

Network Traffic ◽

Intrusion Detection System ◽

Detection System ◽

Support Vector ◽

Large Network ◽

Some of the main challenges in developing an effective network-based intrusion detection system (IDS) include analyzing large network traffic volumes and realizing the decision boundaries between normal and abnormal behaviors. Deploying feature selection together with efficient classifiers in the detection system can overcome these problems. Feature selection finds the most relevant features, thus reduces the dimensionality and complexity to analyze the network traffic. Moreover, using the most relevant features to build the predictive model, reduces the complexity of the developed model, thus reducing the building classifier model time and consequently improves the detection performance. In this study, two different sets of selected features have been adopted to train four machine-learning based classifiers. The two sets of selected features are based on Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) approach respectively. These evolutionary-based algorithms are known to be effective in solving optimization problems. The classifiers used in this study are Naïve Bayes, k-Nearest Neighbor, Decision Tree and Support Vector Machine that have been trained and tested using the NSL-KDD dataset. The performance of the abovementioned classifiers using different features values was evaluated. The experimental results indicate that the detection accuracy improves by approximately 1.55% when implemented using the PSO-based selected features than that of using GA-based selected features. The Decision Tree classifier that was trained with PSO-based selected features outperformed other classifiers with accuracy, precision, recall, and f-score result of 99.38%, 99.36%, 99.32%, and 99.34% respectively. The results show that using optimal features coupling with a good classifier in a detection system able to reduce the classifier model building time, reduce the computational burden to analyze data, and consequently attain high detection rate.