Evaluating explorative prediction power of machine learning algorithms for materials discovery using k-fold forward cross-validation

Abstractin 2019, estimated New Cases 268.600, Breast cancer has one of the most common cancers and is one of the world’s leading causes of death for women. Classification and data mining is an efficient way to classify information. Particularly in the medical field where prediction techniques are commonly used for early detection and effective treatment in diagnosis and research.These paper tests models for the mammogram analysis of breast cancer information from 23 of the more widely used machine learning algorithms such as Decision Tree, Random forest, K-nearest neighbors and support vector machine. The spontaneously splits results are distributed from a replicated 10-fold cross-validation method. The accuracy calculated by Regression Metrics such as Mean Absolute Error, Mean Squared Error, R2 Score and Clustering Metrics such as Adjusted Rand Index, Homogeneity, V-measure.accuracy has been checked F-Measure, AUC, and Cross-Validation. Thus, proper identification of patients with breast cancer would create care opportunities, for example, the supervision and the implementation of intervention plans could benefit the quality of long-term care. Experimental results reveal that the maximum precision 100%with the lowest error rate is obtained with Ada-boost Classifier.

Download Full-text

The Impact of Selecting a Validation Method in Machine Learning on Predicting Basketball Game Outcomes

Symmetry ◽

10.3390/sym12030431 ◽

2020 ◽

Vol 12 (3) ◽

pp. 431 ◽

Cited By ~ 1

Author(s):

Tomislav Horvat ◽

Ladislav Havaš ◽

Dunja Srpak

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Test Validation ◽

Sporting Events ◽

Validation Method ◽

Validation Methods ◽

Independent Events ◽

The Impact

Interest in sports predictions as well as the public availability of large amounts of structured and unstructured data are increasing every day. As sporting events are not completely independent events, but characterized by the influence of the human factor, the adequate selection of the analysis process is very important. In this paper, seven different classification machine learning algorithms are used and validated with two validation methods: Train&Test and cross-validation. Validation methods were analyzed and critically reviewed. The obtained results are analyzed and compared. Analyzing the results of the used machine learning algorithms, the best average prediction results were obtained by using the nearest neighbors algorithm and the worst prediction results were obtained by using decision trees. The cross-validation method obtained better results than the Train&Test validation method. The prediction results of the Train&Test validation method by using disjoint datasets and up-to-date data were also compared. Better results were obtained by using up-to-date data. In addition, directions for future research are also explained.

Download Full-text

A cross-validation scheme for machine learning algorithms in shotgun proteomics

BMC Bioinformatics ◽

10.1186/1471-2105-13-s16-s3 ◽

2012 ◽

Vol 13 (Suppl 16) ◽

pp. S3 ◽

Cited By ~ 15

Author(s):

Viktor Granholm ◽

William Noble ◽

Lukas Käll

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Learning Algorithms ◽

Shotgun Proteomics ◽

Machine Learning Algorithms ◽

Validation Scheme

Download Full-text

Question terminology and representation for question type classification

Terminology ◽

10.1075/term.10.1.08tom ◽

2004 ◽

Vol 10 (1) ◽

pp. 153-168 ◽

Cited By ~ 4

Author(s):

Noriko Tomuro

Keyword(s):

Machine Learning ◽

Classification Accuracy ◽

Cross Validation ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Question Type ◽

Semantic Features ◽

Feature Sets ◽

Fixed Expressions ◽

Type Classification

Question terminology is a set of terms which appear in keywords, idioms and fixed expressions commonly observed in questions. This paper investigates ways to automatically extract question terminology from a corpus of questions and represent them for the purpose of classifying by question type. Our key interest is to see whether or not semantic features can enhance the representation of strongly lexical nature of question sentences. We compare two feature sets: one with lexical features only, and another with a mixture of lexical and semantic features. For evaluation, we measure the classification accuracy made by two machine learning algorithms, C5.0 and PEBLS, by using a procedure called domain cross-validation, which effectively measures the domain transferability of features.

Download Full-text

Cross-validation of machine learning algorithms for malware detection using static features of Windows portable executables: A Comparative Study

2020 IEEE 17th International Conference on Smart Communities: Improving Quality of Life Using ICT, IoT and AI (HONET) ◽

10.1109/honet50430.2020.9322809 ◽

2020 ◽

Author(s):

Warda Aslam ◽

M. M. Fraz ◽

S.K. Rizvi ◽

S. Saleem

Keyword(s):

Machine Learning ◽

Comparative Study ◽

Cross Validation ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms

Download Full-text

Identifying the Main Risk Factors for CVD Prediction Using Machine Learning Algorithms

10.20944/preprints202108.0471.v1 ◽

2021 ◽

Author(s):

Luis Rolando Guarneros-Nolasco ◽

Nancy Aracely Cruz-Ramos ◽

Giner Alor-Hernández ◽

Lisbeth Rodríguez-Mazahua ◽

José Luis Sánchez-Cervantes

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Performance Metrics ◽

Learning Algorithms ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Algorithm Performance ◽

Body Regions ◽

Risks Factors ◽

Fold Cross Validation

CVDs are a leading cause of death globally. In CVDs, the heart is unable to deliver enough blood to other body regions. Since effective and accurate diagnosis of CVDs is essential for CVD prevention and treatment, machine learning (ML) techniques can be effectively and reliably used to discern patients suffering from a CVD from those who do not suffer from any heart condition. Namely, machine learning algorithms (MLAs) play a key role in the diagnosis of CVDs through predictive models that allow us to identify the main risks factors influencing CVD development. In this study, we analyze the performance of ten MLAs on two datasets for CVD prediction and two for CVD diagnosis. Algorithm performance is analyzed on top-two and top-four dataset attributes/features with respect to five performance metrics –accuracy, precision, recall, f1-score, and roc-auc – using the train-test split technique and k-fold cross-validation. Our study identifies the top two and four attributes from each CVD diagnosis/prediction dataset. As our main findings, the ten MLAs exhibited appropriate diagnosis and predictive performance; hence, they can be successfully implemented for improving current CVD diagnosis efforts and help patients around the world, especially in regions where medical staff is lacking.

Download Full-text

Detecting Spam Messages in Twitter Data by Machine learning Algorithms using Cross Validation

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k1913.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 2941-2946

Keyword(s):

Machine Learning ◽

Social Media ◽

Cross Validation ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Human Relations ◽

Detection Model ◽

Social Media Networks ◽

Twitter Data

Now a day’s human relations are maintained by social media networks. Traditional relationships now days are obsolete. To maintain in association, sharing ideas, exchange knowledge between we use social media networking sites. Social media networking sites like Twitter, Facebook, LinkedIn etc are available in the communication environment. Through Twitter media users share their opinions, interests, knowledge to others by messages. At the same time some of the user’s misguide the genuine users. These genuine users are also called solicited users and the users who misguidance are called spammers. These spammers post unwanted information to the non spam users. The non spammers may retweet them to others and they follow the spammers. To avoid this spam messages we propose a methodology by us using machine learning algorithms. To develop our approach used a set of content based features. In spam detection model we used Support vector machine algorithm(SVM) and Naive bayes classification algorithm. To measure the performance of our model we used precision, recall and F measure metrics.

Download Full-text

Patient Survival Prediction with Machine Learning Algorithms

Journal of Intelligent Systems with Applications ◽

10.54856/jiswa.202012126 ◽

2020 ◽

pp. 93-96

Author(s):

Mustafa Berkant Selek ◽

Saadet Sena Egeli ◽

Yalcin Isler

Keyword(s):

Machine Learning ◽

Intensive Care ◽

Cross Validation ◽

Patient Survival ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Survival Prediction ◽

Accuracy Rate ◽

Intensive Care Patients ◽

One Year

In this study, the intensive care unit patient survival is predicted by machine learning algorithms according to the examinations performed in the first 24 hours. The data of intensive care patients collected from approximately two hundred hospitals over a period of one year were used. Algorithms are run in Python environment. Machine learning models were compared with the Cross-Validation method, and the random forest algorithm is used. The model made the prediction with 92,53% accuracy rate.

Download Full-text

Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes - ELSA-Brasil: accuracy study

Sao Paulo Medical Journal ◽

10.1590/1516-3180.2016.0309010217 ◽

2017 ◽

Vol 135 (3) ◽

pp. 234-246 ◽

Cited By ~ 20

Author(s):

André Rodrigues Olivera ◽

Valter Roesler ◽

Cirano Iochpe ◽

Maria Inês Schmidt ◽

Álvaro Vigo ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Predictive Models ◽

Cross Validation ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Undiagnosed Diabetes ◽

Generalization Testing ◽

Tenfold Cross Validation ◽

Using Data

ABSTRACT CONTEXT AND OBJECTIVE: Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task. DESIGN AND SETTING: Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil. METHODS: After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest. RESULTS: The best models were created using artificial neural networks and logistic regression. These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step. CONCLUSION: Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.

Download Full-text

Blocked 3×2 Cross-Validated t-Test for Comparing Supervised Classification Learning Algorithms

Neural Computation ◽

10.1162/neco_a_00532 ◽

2014 ◽

Vol 26 (1) ◽

pp. 208-235 ◽

Cited By ~ 13

Author(s):

Wang Yu ◽

Wang Ruibo ◽

Jia Huichen ◽

Li Jihong

Keyword(s):

Machine Learning ◽

Supervised Classification ◽

Cross Validation ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Variance Estimator ◽

Generalization Error ◽

Classification Learning ◽

Test Of Significance ◽

Classification Tasks

In the research of machine learning algorithms for classification tasks, the comparison of the performances of algorithms is extremely important, and a statistical test of significance for generalization error is often used to perform it in the machine learning literature. In view of the randomness of partitions in cross-validation, a new blocked 3×2 cross-validation is proposed to estimate generalization error in this letter. We then conduct an analysis of variance of the blocked 3×2 cross-validated estimator. A relatively conservative variance estimator that considers the correlation between any two two-fold cross-validations, and was previously neglected in 5×2 cross-validated t and F-tests is put forward. A corresponding test using this variance estimator is presented to compare the performances of algorithms. Simulated results show that the performance of our test is comparable with that of 5×2 cross-validated tests but with less computation complexity.

Download Full-text