scholarly journals Distributed collaborative prediction: Results of the machine learning contest

2017 ◽  
Vol 36 (3) ◽  
pp. 267-269 ◽  
Author(s):  
Matt Hall ◽  
Brendon Hall

The Geophysical Tutorial in the October issue of The Leading Edge was the first we've done on the topic of machine learning. Brendon Hall's article ( Hall, 2016 ) showed readers how to take a small data set — wireline logs and geologic facies data from nine wells in the Hugoton natural gas and helium field of southwest Kansas ( Dubois et al., 2007 ) — and predict the facies in two wells for which the facies data were not available. The article demonstrated with 25 lines of code how to explore the data set, then create, train and test a machine learning model for facies classification, and finally visualize the results. The workflow took a deliberately naive approach using a support vector machine model. It achieved a sort of baseline accuracy rate — a first-order prediction, if you will — of 0.42. That might sound low, but it's not untypical for a naive approach to this kind of problem. For comparison, random draws from the facies distribution score 0.16, which is therefore the true baseline.

2021 ◽  
Vol 6 ◽  
Author(s):  
Haiwen Ge ◽  
Ahmad Hadi Bakir ◽  
Suren Yadav ◽  
Yunseon Kang ◽  
Siva Parameswaran ◽  
...  

In the present paper, an efficient optimization method based on Bayesian updating strategy is developed for the design of a spark-ignition engine equipped with pre-chamber. 3D computational fluid dynamics (CFD) simulation coupled with strategies including design of experiment, genetic algorithm, and machine learning methods is used to optimize the pre-chamber with desired combustion phasing. The optimization process starts from a design of experiment matrix of 11 design parameters, which are used to analytically characterize the pre-chamber geometry and set up the 3D combustion CFD. Taking CA50 as the single objective, the CFD results are then used to train the machine learning models. Different machine learning models are evaluated based on their Root Mean Square Error. Five machine learning models from five different categories are selected for second round evaluation. The trained machine learning model is used in the genetic algorithm optimization, which yields the optimized configuration and is again justified by CFD. The new CFD results based on the optimized design are added into the database to further refine the machine learning model. After 24 iterations for each selected machine learning models, the medium Gaussian support vector machine model is identified as the best method for the present application. Iterations using the medium Gaussian support vector machine model continue until a satisfactory result is achieved. Detailed combustion analysis is conducted to investigate the physical mechanism about how the design of pre-chamber influences the engine's performance. It is found that larger volume of the upper part of the pre-chamber results in stronger jet flow and turbulent intensity which further accelerates the flame propagation inside the pre-chamber, dominating the contrary effects from reduced pressure and temperature. Regression analysis shows that the radius of the pre-chamber is the most influential design parameter. The current work not only sheds light on the optimization of engine design, but also has demonstrated a general strategy applicable to the purpose of arbitrary engine optimization and mechanical system design.


2019 ◽  
Vol 21 (9) ◽  
pp. 662-669 ◽  
Author(s):  
Junnan Zhao ◽  
Lu Zhu ◽  
Weineng Zhou ◽  
Lingfeng Yin ◽  
Yuchen Wang ◽  
...  

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.


2019 ◽  
Vol 15 (3) ◽  
pp. 206-211 ◽  
Author(s):  
Jihui Tang ◽  
Jie Ning ◽  
Xiaoyan Liu ◽  
Baoming Wu ◽  
Rongfeng Hu

<P>Introduction: Machine Learning is a useful tool for the prediction of cell-penetration compounds as drug candidates. </P><P> Materials and Methods: In this study, we developed a novel method for predicting Cell-Penetrating Peptides (CPPs) membrane penetrating capability. For this, we used orthogonal encoding to encode amino acid and each amino acid position as one variable. Then a software of IBM spss modeler and a dataset including 533 CPPs, were used for model screening. </P><P> Results: The results indicated that the machine learning model of Support Vector Machine (SVM) was suitable for predicting membrane penetrating capability. For improvement, the three CPPs with the most longer lengths were used to predict CPPs. The penetration capability can be predicted with an accuracy of close to 95%. </P><P> Conclusion: All the results indicated that by using amino acid position as a variable can be a perspective method for predicting CPPs membrane penetrating capability.</P>


Author(s):  
Dhilsath Fathima.M ◽  
S. Justin Samuel ◽  
R. Hari Haran

Aim: This proposed work is used to develop an improved and robust machine learning model for predicting Myocardial Infarction (MI) could have substantial clinical impact. Objectives: This paper explains how to build machine learning based computer-aided analysis system for an early and accurate prediction of Myocardial Infarction (MI) which utilizes framingham heart study dataset for validation and evaluation. This proposed computer-aided analysis model will support medical professionals to predict myocardial infarction proficiently. Methods: The proposed model utilize the mean imputation to remove the missing values from the data set, then applied principal component analysis to extract the optimal features from the data set to enhance the performance of the classifiers. After PCA, the reduced features are partitioned into training dataset and testing dataset where 70% of the training dataset are given as an input to the four well-liked classifiers as support vector machine, k-nearest neighbor, logistic regression and decision tree to train the classifiers and 30% of test dataset is used to evaluate an output of machine learning model using performance metrics as confusion matrix, classifier accuracy, precision, sensitivity, F1-score, AUC-ROC curve. Results: Output of the classifiers are evaluated using performance measures and we observed that logistic regression provides high accuracy than K-NN, SVM, decision tree classifiers and PCA performs sound as a good feature extraction method to enhance the performance of proposed model. From these analyses, we conclude that logistic regression having good mean accuracy level and standard deviation accuracy compared with the other three algorithms. AUC-ROC curve of the proposed classifiers is analyzed from the output figure.4, figure.5 that logistic regression exhibits good AUC-ROC score, i.e. around 70% compared to k-NN and decision tree algorithm. Conclusion: From the result analysis, we infer that this proposed machine learning model will act as an optimal decision making system to predict the acute myocardial infarction at an early stage than an existing machine learning based prediction models and it is capable to predict the presence of an acute myocardial Infarction with human using the heart disease risk factors, in order to decide when to start lifestyle modification and medical treatment to prevent the heart disease.


2021 ◽  
Author(s):  
Junjie Shi ◽  
Jiang Bian ◽  
Jakob Richter ◽  
Kuan-Hsun Chen ◽  
Jörg Rahnenführer ◽  
...  

AbstractThe predictive performance of a machine learning model highly depends on the corresponding hyper-parameter setting. Hence, hyper-parameter tuning is often indispensable. Normally such tuning requires the dedicated machine learning model to be trained and evaluated on centralized data to obtain a performance estimate. However, in a distributed machine learning scenario, it is not always possible to collect all the data from all nodes due to privacy concerns or storage limitations. Moreover, if data has to be transferred through low bandwidth connections it reduces the time available for tuning. Model-Based Optimization (MBO) is one state-of-the-art method for tuning hyper-parameters but the application on distributed machine learning models or federated learning lacks research. This work proposes a framework $$\textit{MODES}$$ MODES that allows to deploy MBO on resource-constrained distributed embedded systems. Each node trains an individual model based on its local data. The goal is to optimize the combined prediction accuracy. The presented framework offers two optimization modes: (1) $$\textit{MODES}$$ MODES -B considers the whole ensemble as a single black box and optimizes the hyper-parameters of each individual model jointly, and (2) $$\textit{MODES}$$ MODES -I considers all models as clones of the same black box which allows it to efficiently parallelize the optimization in a distributed setting. We evaluate $$\textit{MODES}$$ MODES by conducting experiments on the optimization for the hyper-parameters of a random forest and a multi-layer perceptron. The experimental results demonstrate that, with an improvement in terms of mean accuracy ($$\textit{MODES}$$ MODES -B), run-time efficiency ($$\textit{MODES}$$ MODES -I), and statistical stability for both modes, $$\textit{MODES}$$ MODES outperforms the baseline, i.e., carry out tuning with MBO on each node individually with its local sub-data set.


2020 ◽  
Vol 6 ◽  
Author(s):  
Jaime de Miguel Rodríguez ◽  
Maria Eugenia Villafañe ◽  
Luka Piškorec ◽  
Fernando Sancho Caparrini

Abstract This work presents a methodology for the generation of novel 3D objects resembling wireframes of building types. These result from the reconstruction of interpolated locations within the learnt distribution of variational autoencoders (VAEs), a deep generative machine learning model based on neural networks. The data set used features a scheme for geometry representation based on a ‘connectivity map’ that is especially suited to express the wireframe objects that compose it. Additionally, the input samples are generated through ‘parametric augmentation’, a strategy proposed in this study that creates coherent variations among data by enabling a set of parameters to alter representative features on a given building type. In the experiments that are described in this paper, more than 150 k input samples belonging to two building types have been processed during the training of a VAE model. The main contribution of this paper has been to explore parametric augmentation for the generation of large data sets of 3D geometries, showcasing its problems and limitations in the context of neural networks and VAEs. Results show that the generation of interpolated hybrid geometries is a challenging task. Despite the difficulty of the endeavour, promising advances are presented.


2018 ◽  
Vol 34 (3) ◽  
pp. 569-581 ◽  
Author(s):  
Sujata Rani ◽  
Parteek Kumar

Abstract In this article, an innovative approach to perform the sentiment analysis (SA) has been presented. The proposed system handles the issues of Romanized or abbreviated text and spelling variations in the text to perform the sentiment analysis. The training data set of 3,000 movie reviews and tweets has been manually labeled by native speakers of Hindi in three classes, i.e. positive, negative, and neutral. The system uses WEKA (Waikato Environment for Knowledge Analysis) tool to convert these string data into numerical matrices and applies three machine learning techniques, i.e. Naive Bayes (NB), J48, and support vector machine (SVM). The proposed system has been tested on 100 movie reviews and tweets, and it has been observed that SVM has performed best in comparison to other classifiers, and it has an accuracy of 68% for movie reviews and 82% in case of tweets. The results of the proposed system are very promising and can be used in emerging applications like SA of product reviews and social media analysis. Additionally, the proposed system can be used in other cultural/social benefits like predicting/fighting human riots.


2018 ◽  
Author(s):  
Mercy Nyamewaa Asiedu ◽  
Anish Simhal ◽  
Usamah Chaudhary ◽  
Jenna L. Mueller ◽  
Christopher T. Lam ◽  
...  

AbstractGoalIn this work, we propose methods for (1) automatic feature extraction and classification for acetic acid and Lugol’s iodine cervigrams and (2) methods for combining features/diagnosis of different contrasts in cervigrams for improved performance.MethodsWe developed algorithms to pre-process pathology-labeled cervigrams and to extract simple but powerful color and textural-based features. The features were used to train a support vector machine model to classify cervigrams based on corresponding pathology for visual inspection with acetic acid, visual inspection with Lugol’s iodine, and a combination of the two contrasts.ResultsThe proposed framework achieved a sensitivity, specificity, and accuracy of 81.3%, 78.6%, and 80.0%, respectively when used to distinguish cervical intraepithelial neoplasia (CIN+) relative to normal and benign tissues. This is superior to the average values achieved by three expert physicians on the same data set for discriminating normal/benign cases from CIN+ (77% sensitivity, 51% specificity, 63% accuracy).ConclusionThe results suggest that utilizing simple color- and textural-based features from visual inspection with acetic acid and visual inspection with Lugol’s iodine images may provide unbiased automation of cervigrams.SignificanceThis would enable automated, expert-level diagnosis of cervical pre-cancer at the point-of-care.


2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Hai-Bang Ly ◽  
Thuy-Anh Nguyen ◽  
Binh Thai Pham

Soil cohesion (C) is one of the critical soil properties and is closely related to basic soil properties such as particle size distribution, pore size, and shear strength. Hence, it is mainly determined by experimental methods. However, the experimental methods are often time-consuming and costly. Therefore, developing an alternative approach based on machine learning (ML) techniques to solve this problem is highly recommended. In this study, machine learning models, namely, support vector machine (SVM), Gaussian regression process (GPR), and random forest (RF), were built based on a data set of 145 soil samples collected from the Da Nang-Quang Ngai expressway project, Vietnam. The database also includes six input parameters, that is, clay content, moisture content, liquid limit, plastic limit, specific gravity, and void ratio. The performance of the model was assessed by three statistical criteria, namely, the correlation coefficient (R), mean absolute error (MAE), and root mean square error (RMSE). The results demonstrated that the proposed RF model could accurately predict soil cohesion with high accuracy (R = 0.891) and low error (RMSE = 3.323 and MAE = 2.511), and its predictive capability is better than SVM and GPR. Therefore, the RF model can be used as a cost-effective approach in predicting soil cohesion forces used in the design and inspection of constructions.


A large volume of datasets is available in various fields that are stored to be somewhere which is called big data. Big Data healthcare has clinical data set of every patient records in huge amount and they are maintained by Electronic Health Records (EHR). More than 80 % of clinical data is the unstructured format and reposit in hundreds of forms. The challenges and demand for data storage, analysis is to handling large datasets in terms of efficiency and scalability. Hadoop Map reduces framework uses big data to store and operate any kinds of data speedily. It is not solely meant for storage system however conjointly a platform for information storage moreover as processing. It is scalable and fault-tolerant to the systems. Also, the prediction of the data sets is handled by machine learning algorithm. This work focuses on the Extreme Machine Learning algorithm (ELM) that can utilize the optimized way of finding a solution to find disease risk prediction by combining ELM with Cuckoo Search optimization-based Support Vector Machine (CS-SVM). The proposed work also considers the scalability and accuracy of big data models, thus the proposed algorithm greatly achieves the computing work and got good results in performance of both veracity and efficiency.


Sign in / Sign up

Export Citation Format

Share Document