scholarly journals Three machine learning models for the 2019 Solubility Challenge

ADMET & DMPK ◽  
2020 ◽  
Author(s):  
John Mitchell

<p class="ADMETabstracttext">We describe three machine learning models submitted to the 2019 Solubility Challenge. All are founded on tree-like classifiers, with one model being based on Random Forest and another on the related Extra Trees algorithm. The third model is a consensus predictor combining the former two with a Bagging classifier. We call this consensus classifier Vox Machinarum, and here discuss how it benefits from the Wisdom of Crowds. On the first 2019 Solubility Challenge test set of 100 low-variance intrinsic aqueous solubilities, Extra Trees is our best classifier. One the other, a high-variance set of 32 molecules, we find that Vox Machinarum and Random Forest both perform a little better than Extra Trees, and almost equally to one another. We also compare the gold standard solubilities from the 2019 Solubility Challenge with a set of literature-based solubilities for most of the same compounds.</p>

2020 ◽  
Vol 32 ◽  
pp. 03003
Author(s):  
Bhushan Deore ◽  
Aditya Kyatham ◽  
Shubham Narkhede

The following paper provides a novel approach for Network Intrusion Detection System using Machine Learning and Deep Learning. This approach uses two MLP (Multi-Layer Perceptron) models one having 3 layers and other having 6 layers. Random Forest is also used for classification. These models are ensembled in such a way that the final accuracy is boosted and also the testing time is reduced. Researchers have implemented various ways for the ensemble of multiple models but we are using contradiction management concept to ensemble machine learning models. Contradiction Management concept means if two machine learning models are contradicting in their decisions (in our case 3-layer MLP and Random Forest), then the third model’s (6-layer MLP) decision is considered whose accuracy is higher than the previous models. The third model is only used for testing when the previous two models contradict in their decision because the testing time of third model is higher than the two previous models as the third model has complex architecture. This approach increased the final accuracy as ensemble of multiple models is done and also testing time has reduced. The novelty of this paper is the choice and the combination of the models for the purpose of Network security.


2020 ◽  
Author(s):  
Shreya Reddy ◽  
Lisa Ewen ◽  
Pankti Patel ◽  
Prerak Patel ◽  
Ankit Kundal ◽  
...  

<p>As bots become more prevalent and smarter in the modern age of the internet, it becomes ever more important that they be identified and removed. Recent research has dictated that machine learning methods are accurate and the gold standard of bot identification on social media. Unfortunately, machine learning models do not come without their negative aspects such as lengthy training times, difficult feature selection, and overwhelming pre-processing tasks. To overcome these difficulties, we are proposing a blockchain framework for bot identification. At the current time, it is unknown how this method will perform, but it serves to prove the existence of an overwhelming gap of research under this area.<i></i></p>


2020 ◽  
Vol 22 (10) ◽  
pp. 694-704 ◽  
Author(s):  
Wanben Zhong ◽  
Bineng Zhong ◽  
Hongbo Zhang ◽  
Ziyi Chen ◽  
Yan Chen

Aim and Objective: Cancer is one of the deadliest diseases, taking the lives of millions every year. Traditional methods of treating cancer are expensive and toxic to normal cells. Fortunately, anti-cancer peptides (ACPs) can eliminate this side effect. However, the identification and development of new anti Materials and Methods: In our study, a multi-classifier system was used, combined with multiple machine learning models, to predict anti-cancer peptides. These individual learners are composed of different feature information and algorithms, and form a multi-classifier system by voting. Results and Conclusion: The experiments show that the overall prediction rate of each individual learner is above 80% and the overall accuracy of multi-classifier system for anti-cancer peptides prediction can reach 95.93%, which is better than the existing prediction model.


Author(s):  
Farrikh Alzami ◽  
Erika Devi Udayanti ◽  
Dwi Puji Prabowo ◽  
Rama Aria Megantara

Sentiment analysis in terms of polarity classification is very important in everyday life, with the existence of polarity, many people can find out whether the respected document has positive or negative sentiment so that it can help in choosing and making decisions. Sentiment analysis usually done manually. Therefore, an automatic sentiment analysis classification process is needed. However, it is rare to find studies that discuss extraction features and which learning models are suitable for unstructured sentiment analysis types with the Amazon food review case. This research explores some extraction features such as Word Bags, TF-IDF, Word2Vector, as well as a combination of TF-IDF and Word2Vector with several machine learning models such as Random Forest, SVM, KNN and Naïve Bayes to find out a combination of feature extraction and learning models that can help add variety to the analysis of polarity sentiments. By assisting with document preparation such as html tags and punctuation and special characters, using snowball stemming, TF-IDF results obtained with SVM are suitable for obtaining a polarity classification in unstructured sentiment analysis for the case of Amazon food review with a performance result of 87,3 percent.


2021 ◽  
Vol 9 ◽  
Author(s):  
Daniel Lowell Weller ◽  
Tanzy M. T. Love ◽  
Martin Wiedmann

Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.


Geosciences ◽  
2021 ◽  
Vol 11 (7) ◽  
pp. 265
Author(s):  
Stefan Rauter ◽  
Franz Tschuchnigg

The classification of soils into categories with a similar range of properties is a fundamental geotechnical engineering procedure. At present, this classification is based on various types of cost- and time-intensive laboratory and/or in situ tests. These soil investigations are essential for each individual construction site and have to be performed prior to the design of a project. Since Machine Learning could play a key role in reducing the costs and time needed for a suitable site investigation program, the basic ability of Machine Learning models to classify soils from Cone Penetration Tests (CPT) is evaluated. To find an appropriate classification model, 24 different Machine Learning models, based on three different algorithms, are built and trained on a dataset consisting of 1339 CPT. The applied algorithms are a Support Vector Machine, an Artificial Neural Network and a Random Forest. As input features, different combinations of direct cone penetration test data (tip resistance qc, sleeve friction fs, friction ratio Rf, depth d), combined with “defined”, thus, not directly measured data (total vertical stresses σv, effective vertical stresses σ’v and hydrostatic pore pressure u0), are used. Standard soil classes based on grain size distributions and soil classes based on soil behavior types according to Robertson are applied as targets. The different models are compared with respect to their prediction performance and the required learning time. The best results for all targets were obtained with models using a Random Forest classifier. For the soil classes based on grain size distribution, an accuracy of about 75%, and for soil classes according to Robertson, an accuracy of about 97–99%, was reached.


Mekatronika ◽  
2020 ◽  
Vol 2 (1) ◽  
pp. 73-78
Author(s):  
Nur Fahriza Mohd Ali ◽  
Ahmad Farhan Mohd Sadullah ◽  
Anwar P.P. Abdul Majeed ◽  
Mohd Azraai Mohd Razman ◽  
Rabiu Muazu Musa

A door-to-door journey in a public transportation system is a notable concept that is practically being promoted among users to consider public transport as an important alternative. The door-to-door journey will integrate the travel segments starting from home to destination, including all visible amenities. Users’ preferences on the time travel of these key segments are necessary to be understood. In this case, Machine Learning technique has been seen as a robust computational advancement to forecast their travel mode choice. However, the most convenient model as the best predictor is still questionable. To address this issue, we employed some pre-eminent machine learning models, specifically Random Forest (RF), Naïve Bayes (NB), Logistic Regression (LR), k-Nearest Neighbor (kNN) as well as Support Vector Machine (SVM), to compare their travel mode choice prediction performance of users in the city of Kuantan. The data collection was conducted in Kuantan City via Revealed/Stated Preferences (RPSP) Survey between 8:00 AM to 5:00 PM on weekdays. The data collected was split into a ratio of 80:20 for training and testing before evaluating them between the aforesaid models. The results depicted that the Random Forest could provide satisfactory classification accuracies for both training and testing data up to 68.3% and 61.3%, respectively, compared to the other evaluated machine learning models. In summary, Random Forest provides a good result in the training and testing data and is considered as the best predictor in this research to forecast users’ mode choice in the city of Kuantan.


Sign in / Sign up

Export Citation Format

Share Document