Evaluating explorative prediction power of machine learning algorithms for materials discovery using k-fold forward cross-validation

2020 ◽  
Vol 171 ◽  
pp. 109203 ◽  
Author(s):  
Zheng Xiong ◽  
Yuxin Cui ◽  
Zhonghao Liu ◽  
Yong Zhao ◽  
Ming Hu ◽  
...  
Author(s):  
Peter T. Habib ◽  
Alsamman M. Alsamman ◽  
Sameh E. Hassnein ◽  
Ghada A. Shereif ◽  
Aladdin Hamwieh

Abstractin 2019, estimated New Cases 268.600, Breast cancer has one of the most common cancers and is one of the world’s leading causes of death for women. Classification and data mining is an efficient way to classify information. Particularly in the medical field where prediction techniques are commonly used for early detection and effective treatment in diagnosis and research.These paper tests models for the mammogram analysis of breast cancer information from 23 of the more widely used machine learning algorithms such as Decision Tree, Random forest, K-nearest neighbors and support vector machine. The spontaneously splits results are distributed from a replicated 10-fold cross-validation method. The accuracy calculated by Regression Metrics such as Mean Absolute Error, Mean Squared Error, R2 Score and Clustering Metrics such as Adjusted Rand Index, Homogeneity, V-measure.accuracy has been checked F-Measure, AUC, and Cross-Validation. Thus, proper identification of patients with breast cancer would create care opportunities, for example, the supervision and the implementation of intervention plans could benefit the quality of long-term care. Experimental results reveal that the maximum precision 100%with the lowest error rate is obtained with Ada-boost Classifier.


Symmetry ◽  
2020 ◽  
Vol 12 (3) ◽  
pp. 431 ◽  
Author(s):  
Tomislav Horvat ◽  
Ladislav Havaš ◽  
Dunja Srpak

Interest in sports predictions as well as the public availability of large amounts of structured and unstructured data are increasing every day. As sporting events are not completely independent events, but characterized by the influence of the human factor, the adequate selection of the analysis process is very important. In this paper, seven different classification machine learning algorithms are used and validated with two validation methods: Train&Test and cross-validation. Validation methods were analyzed and critically reviewed. The obtained results are analyzed and compared. Analyzing the results of the used machine learning algorithms, the best average prediction results were obtained by using the nearest neighbors algorithm and the worst prediction results were obtained by using decision trees. The cross-validation method obtained better results than the Train&Test validation method. The prediction results of the Train&Test validation method by using disjoint datasets and up-to-date data were also compared. Better results were obtained by using up-to-date data. In addition, directions for future research are also explained.


Terminology ◽  
2004 ◽  
Vol 10 (1) ◽  
pp. 153-168 ◽  
Author(s):  
Noriko Tomuro

Question terminology is a set of terms which appear in keywords, idioms and fixed expressions commonly observed in questions. This paper investigates ways to automatically extract question terminology from a corpus of questions and represent them for the purpose of classifying by question type. Our key interest is to see whether or not semantic features can enhance the representation of strongly lexical nature of question sentences. We compare two feature sets: one with lexical features only, and another with a mixture of lexical and semantic features. For evaluation, we measure the classification accuracy made by two machine learning algorithms, C5.0 and PEBLS, by using a procedure called domain cross-validation, which effectively measures the domain transferability of features.


Author(s):  
Luis Rolando Guarneros-Nolasco ◽  
Nancy Aracely Cruz-Ramos ◽  
Giner Alor-Hernández ◽  
Lisbeth Rodríguez-Mazahua ◽  
José Luis Sánchez-Cervantes

CVDs are a leading cause of death globally. In CVDs, the heart is unable to deliver enough blood to other body regions. Since effective and accurate diagnosis of CVDs is essential for CVD prevention and treatment, machine learning (ML) techniques can be effectively and reliably used to discern patients suffering from a CVD from those who do not suffer from any heart condition. Namely, machine learning algorithms (MLAs) play a key role in the diagnosis of CVDs through predictive models that allow us to identify the main risks factors influencing CVD development. In this study, we analyze the performance of ten MLAs on two datasets for CVD prediction and two for CVD diagnosis. Algorithm performance is analyzed on top-two and top-four dataset attributes/features with respect to five performance metrics –accuracy, precision, recall, f1-score, and roc-auc – using the train-test split technique and k-fold cross-validation. Our study identifies the top two and four attributes from each CVD diagnosis/prediction dataset. As our main findings, the ten MLAs exhibited appropriate diagnosis and predictive performance; hence, they can be successfully implemented for improving current CVD diagnosis efforts and help patients around the world, especially in regions where medical staff is lacking.


Now a day’s human relations are maintained by social media networks. Traditional relationships now days are obsolete. To maintain in association, sharing ideas, exchange knowledge between we use social media networking sites. Social media networking sites like Twitter, Facebook, LinkedIn etc are available in the communication environment. Through Twitter media users share their opinions, interests, knowledge to others by messages. At the same time some of the user’s misguide the genuine users. These genuine users are also called solicited users and the users who misguidance are called spammers. These spammers post unwanted information to the non spam users. The non spammers may retweet them to others and they follow the spammers. To avoid this spam messages we propose a methodology by us using machine learning algorithms. To develop our approach used a set of content based features. In spam detection model we used Support vector machine algorithm(SVM) and Naive bayes classification algorithm. To measure the performance of our model we used precision, recall and F measure metrics.


Author(s):  
Mustafa Berkant Selek ◽  
Saadet Sena Egeli ◽  
Yalcin Isler

In this study, the intensive care unit patient survival is predicted by machine learning algorithms according to the examinations performed in the first 24 hours. The data of intensive care patients collected from approximately two hundred hospitals over a period of one year were used. Algorithms are run in Python environment. Machine learning models were compared with the Cross-Validation method, and the random forest algorithm is used. The model made the prediction with 92,53% accuracy rate.


2017 ◽  
Vol 135 (3) ◽  
pp. 234-246 ◽  
Author(s):  
André Rodrigues Olivera ◽  
Valter Roesler ◽  
Cirano Iochpe ◽  
Maria Inês Schmidt ◽  
Álvaro Vigo ◽  
...  

ABSTRACT CONTEXT AND OBJECTIVE: Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task. DESIGN AND SETTING: Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil. METHODS: After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest. RESULTS: The best models were created using artificial neural networks and logistic regression. These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step. CONCLUSION: Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.


2014 ◽  
Vol 26 (1) ◽  
pp. 208-235 ◽  
Author(s):  
Wang Yu ◽  
Wang Ruibo ◽  
Jia Huichen ◽  
Li Jihong

In the research of machine learning algorithms for classification tasks, the comparison of the performances of algorithms is extremely important, and a statistical test of significance for generalization error is often used to perform it in the machine learning literature. In view of the randomness of partitions in cross-validation, a new blocked 3×2 cross-validation is proposed to estimate generalization error in this letter. We then conduct an analysis of variance of the blocked 3×2 cross-validated estimator. A relatively conservative variance estimator that considers the correlation between any two two-fold cross-validations, and was previously neglected in 5×2 cross-validated t and F-tests is put forward. A corresponding test using this variance estimator is presented to compare the performances of algorithms. Simulated results show that the performance of our test is comparable with that of 5×2 cross-validated tests but with less computation complexity.


Sign in / Sign up

Export Citation Format

Share Document