Modeling Traders’ Behavior with Deep Learning and Machine Learning Methods: Evidence from BIST 100 Index

This work presents new prediction models based on recent developments in machine learning methods, such as Random Forest (RF) and AdaBoost, and compares them with more classical approaches, i.e., support vector machines (SVMs) and neural networks (NNs). The models predict Pseudo-nitzschia spp. blooms in the Galician Rias Baixas. This work builds on a previous study by the authors (doi.org/10.1016/j.pocean.2014.03.003) but uses an extended database (from 2002 to 2012) and new algorithms. Our results show that RF and AdaBoost provide better prediction results compared to SVMs and NNs, as they show improved performance metrics and a better balance between sensitivity and specificity. Classical machine learning approaches show higher sensitivities, but at a cost of lower specificity and higher percentages of false alarms (lower precision). These results seem to indicate a greater adaptation of new algorithms (RF and AdaBoost) to unbalanced datasets. Our models could be operationally implemented to establish a short-term prediction system.

Download Full-text

Detecting a keystone species European aspen in boreal forests with airborne hyperspectral, LiDAR and UAV data with machine learning methods

10.5194/egusphere-egu21-16273 ◽

2021 ◽

Author(s):

Timo Kumpula ◽

Janne Mäyrä ◽

Anton Kuzmin ◽

Arto Viinikka ◽

Sonja Kivinen ◽

...

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Deep Learning ◽

High Resolution ◽

Boreal Forests ◽

Tree Level ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods ◽

European Aspen

Sustainable forest management increasingly highlights the maintenance of biological diversity and requires up-to-date information on the occurrence and distribution of key ecological features in forest environments. Different proxy variables indicating species richness and quality of the sites are essential for efficient detecting and monitoring forest biodiversity. European aspen (Populus tremula L.) is a minor deciduous tree species with a high importance in maintaining biodiversity in boreal forests. Large aspen trees host hundreds of species, many of them classified as threatened. However, accurate fine-scale spatial data on aspen occurrence remains scarce and incomprehensive.&#160;We studied detection of aspen using different remote sensing techniques in Evo, southern Finland. Our study area of 83 km2 contains both managed and protected southern boreal forests characterized by Scots pine (Pinus sylvestris L.), Norway spruce (Picea abies (L.) Karst), and birch (Betula pendula and pubescens L.), whereas European aspen has a relatively sparse and scattered occurrence in the area. We collected high-resolution airborne hyperspectral and airborne laser scanning data covering the whole study area and ultra-high resolution unmanned aerial vehicle (UAV) data with RGB and multispectral sensors from selected parts of the area. We tested the discrimination of aspen from other species at tree level using different machine learning methods (Support Vector Machines, Random Forest, Gradient Boosting Machine) and deep learning methods (3D convolutional neural networks).&#160;Airborne hyperspectral and lidar data gave excellent results with machine learning and deep learning classification methods The highest classification accuracies for aspen varied between 91-92% (F1-score). The most important wavelengths for discriminating aspen from other species included reflectance bands of red edge range (724&#8211;727 nm) and shortwave infrared (1520&#8211;1564 nm and 1684&#8211;1706 nm) (Viinikka et al. 2020; M&#228;yr&#228; et al 2021). Aspen detection using RGB and multispectral data also gave good results (highest F1-score of aspen = 87%) (Kuzmin et al 2021). Different remote sensing data enabled production of a spatially explicit map of aspen occurrence in the study area. Information on aspen occurrence and abundance can significantly contribute to biodiversity management and conservation efforts in boreal forests. Our results can be further utilized in upscaling efforts aiming at aspen detection over larger geographical areas using satellite images.

Download Full-text

Evaluation of Different Machine Learning Methods and Deep-Learning Convolutional Neural Networks for Landslide Detection

Remote Sensing ◽

10.3390/rs11020196 ◽

2019 ◽

Vol 11 (2) ◽

pp. 196 ◽

Cited By ~ 101

Author(s):

Omid Ghorbanzadeh ◽

Thomas Blaschke ◽

Khalil Gholamnia ◽

Sansar Meena ◽

Dirk Tiede ◽

...

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Deep Learning ◽

Expert Knowledge ◽

Accuracy Assessment ◽

Window Size ◽

Optical Data ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods

There is a growing demand for detailed and accurate landslide maps and inventories around the globe, but particularly in hazard-prone regions such as the Himalayas. Most standard mapping methods require expert knowledge, supervision and fieldwork. In this study, we use optical data from the Rapid Eye satellite and topographic factors to analyze the potential of machine learning methods, i.e., artificial neural network (ANN), support vector machines (SVM) and random forest (RF), and different deep-learning convolution neural networks (CNNs) for landslide detection. We use two training zones and one test zone to independently evaluate the performance of different methods in the highly landslide-prone Rasuwa district in Nepal. Twenty different maps are created using ANN, SVM and RF and different CNN instantiations and are compared against the results of extensive fieldwork through a mean intersection-over-union (mIOU) and other common metrics. This accuracy assessment yields the best result of 78.26% mIOU for a small window size CNN, which uses spectral information only. The additional information from a 5 m digital elevation model helps to discriminate between human settlements and landslides but does not improve the overall classification accuracy. CNNs do not automatically outperform ANN, SVM and RF, although this is sometimes claimed. Rather, the performance of CNNs strongly depends on their design, i.e., layer depth, input window sizes and training strategies. Here, we conclude that the CNN method is still in its infancy as most researchers will either use predefined parameters in solutions like Google TensorFlow or will apply different settings in a trial-and-error manner. Nevertheless, deep-learning can improve landslide mapping in the future if the effects of the different designs are better understood, enough training samples exist, and the effects of augmentation strategies to artificially increase the number of existing samples are better understood.

Download Full-text

Deep learning accurately predicts estrogen receptor status in breast cancer metabolomics data

10.1101/214254 ◽

2017 ◽

Author(s):

Fadhl M Alakwaa ◽

Kumardeep Chaudhary ◽

Lana X Garmire

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Estrogen Receptor ◽

Deep Learning ◽

Support Vector ◽

Integrated Analysis ◽

Learning Method ◽

Learning Methods ◽

Metabolomics Data ◽

Machine Learning Methods

ABSTRACTMetabolomics holds the promise as a new technology to diagnose highly heterogeneous diseases. Conventionally, metabolomics data analysis for diagnosis is done using various statistical and machine learning based classification methods. However, it remains unknown if deep neural network, a class of increasingly popular machine learning methods, is suitable to classify metabolomics data. Here we use a cohort of 271 breast cancer tissues, 204 positive estrogen receptor (ER+) and 67 negative estrogen receptor (ER-), to test the accuracies of autoencoder, a deep learning (DL) framework, as well as six widely used machine learning models, namely Random Forest (RF), Support Vector Machines (SVM), Recursive Partitioning and Regression Trees (RPART), Linear Discriminant Analysis (LDA), Prediction Analysis for Microarrays (PAM), and Generalized Boosted Models (GBM). DL framework has the highest area under the curve (AUC) of 0.93 in classifying ER+/ER-patients, compared to the other six machine learning algorithms. Furthermore, the biological interpretation of the first hidden layer reveals eight commonly enriched significant metabolomics pathways (adjusted P-value<0.05) that cannot be discovered by other machine learning methods. Among them, protein digestion & absorption and ATP-binding cassette (ABC) transporters pathways are also confirmed in integrated analysis between metabolomics and gene expression data in these samples. In summary, deep learning method shows advantages for metabolomics based breast cancer ER status classification, with both the highest prediction accurcy (AUC=0.93) and better revelation of disease biology. We encourage the adoption of autoencoder based deep learning method in the metabolomics research community for classification.

Download Full-text

A stacking ensemble deep learning approach to cancer type classification based on TCGA data

Scientific Reports ◽

10.1038/s41598-021-95128-x ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Mohanad Mohammed ◽

Henry Mwambi ◽

Innocent B. Mboya ◽

Murtada K. Elbashir ◽

Bernard Omolo

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Feature Selection Method ◽

Morphological Characteristics ◽

Support Vector ◽

Cancer Type ◽

Learning Methods ◽

Machine Learning Methods ◽

Proposed Model ◽

Significant Difference

AbstractCancer tumor classification based on morphological characteristics alone has been shown to have serious limitations. Breast, lung, colorectal, thyroid, and ovarian are the most commonly diagnosed cancers among women. Precise classification of cancers into their types is considered a vital problem for cancer diagnosis and therapy. In this paper, we proposed a stacking ensemble deep learning model based on one-dimensional convolutional neural network (1D-CNN) to perform a multi-class classification on the five common cancers among women based on RNASeq data. The RNASeq gene expression data was downloaded from Pan-Cancer Atlas using GDCquery function of the TCGAbiolinks package in the R software. We used least absolute shrinkage and selection operator (LASSO) as feature selection method. We compared the results of the new proposed model with and without LASSO with the results of the single 1D-CNN and machine learning methods which include support vector machines with radial basis function, linear, and polynomial kernels; artificial neural networks; k-nearest neighbors; bagging trees. The results show that the proposed model with and without LASSO has a better performance compared to other classifiers. Also, the results show that the machine learning methods (SVM-R, SVM-L, SVM-P, ANN, KNN, and bagging trees) with under-sampling have better performance than with over-sampling techniques. This is supported by the statistical significance test of accuracy where the p-values for differences between the SVM-R and SVM-P, SVM-R and ANN, SVM-R and KNN are found to be p = 0.003, p = < 0.001, and p = < 0.001, respectively. Also, SVM-L had a significant difference compared to ANN p = 0.009. Moreover, SVM-P and ANN, SVM-P and KNN are found to be significantly different with p-values p = < 0.001 and p = < 0.001, respectively. In addition, ANN and bagging trees, ANN and KNN were found to be significantly different with p-values p = < 0.001 and p = 0.004, respectively. Thus, the proposed model can help in the early detection and diagnosis of cancer in women, and hence aid in designing early treatment strategies to improve survival.

Download Full-text

A Very Large-Scale Bioactivity Comparison of Deep Learning and Multiple Machine Learning Algorithms for Drug Discovery

10.26434/chemrxiv.12781241 ◽

2020 ◽

Author(s):

Thomas R. Lane ◽

Daniel H. Foil ◽

Eni Minerali ◽

Fabio Urbina ◽

Kimberley M. Zorn ◽

...

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Deep Learning ◽

Drug Discovery ◽

Deep Neural Networks ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods

Machine learning methods are attracting considerable attention from the pharmaceutical industry for use in drug discovery and applications beyond. In recent studies we have applied multiple machine learning algorithms, modeling metrics and in some cases compared molecular descriptors to build models for individual targets or properties on a relatively small scale. Several research groups have used large numbers of datasets from public databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The largest of these types of studies used on the order of 1400 datasets. We have now extracted well over 5000 datasets from CHEMBL for use with the ECFP6 fingerprint and comparison of our proprietary software Assay CentralTM with random forest, k-Nearest Neighbors, support vector classification, naïve Bayesian, AdaBoosted decision trees, and deep neural networks (3 levels). Model performance <a>was</a> assessed using an array of five-fold cross-validation metrics including area-under-the-curve, F1 score, Cohen’s kappa and Matthews correlation coefficient. <a>Based on ranked normalized scores for the metrics or datasets all methods appeared comparable while the distance from the top indicated Assay CentralTM and support vector classification were comparable. </a>Unlike prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case where minimal tuning was performed of any of the methods. If anything, Assay CentralTM may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing over 570,000 unique compounds was based on Assay CentralTMperformance, but support vector classification seems to be a strong competitor. We also apply Assay CentralTM to prospective predictions for PXR and hERG to further validate these models. This work currently appears to be the largest comparison of machine learning algorithms to date. Future studies will likely evaluate additional databases, descriptors and algorithms, as well as further refining methods for evaluating and comparing models.

Download Full-text

Use of Machine Learning and Deep Learning to Predict the Outcomes of Major League Baseball Matches

Applied Sciences ◽

10.3390/app11104499 ◽

2021 ◽

Vol 11 (10) ◽

pp. 4499

Author(s):

Mei-Ling Huang ◽

Yun-Zhi Li

Keyword(s):

Neural Network ◽

Machine Learning ◽

Feature Selection ◽

Deep Learning ◽

Prediction Accuracy ◽

Major League Baseball ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods ◽

Major League

Major League Baseball (MLB) is the highest level of professional baseball in the world and accounts for some of the most popular international sporting events. Many scholars have conducted research on predicting the outcome of MLB matches. The accuracy in predicting the results of baseball games is low. Therefore, deep learning and machine learning methods were used to build models for predicting the outcomes (win/loss) of MLB matches and investigate the differences between the models in terms of their performance. The match data of 30 teams during the 2019 MLB season with only the starting pitcher or with all pitchers in the pitcher category were collected to compare the prediction accuracy. A one-dimensional convolutional neural network (1DCNN), a traditional machine learning artificial neural network (ANN), and a support vector machine (SVM) were used to predict match outcomes with fivefold cross-validation to evaluate model performance. The highest prediction accuracies were 93.4%, 93.91%, and 93.90% with the 1DCNN, ANN, SVM models, respectively, before feature selection; after feature selection, the highest accuracies obtained were 94.18% and 94.16% with the ANN and SVM models, respectively. The prediction results obtained with the three models were similar, and the prediction accuracies were much higher than those obtained in related studies. Moreover, a 1DCNN was used for the first time for predicting the outcome of MLB matches, and it achieved a prediction accuracy similar to that achieved by machine learning methods.

Download Full-text

A Very Large-Scale Bioactivity Comparison of Deep Learning and Multiple Machine Learning Algorithms for Drug Discovery

10.26434/chemrxiv.12781241.v1 ◽

2020 ◽

Author(s):

Thomas R. Lane ◽

Daniel H. Foil ◽

Eni Minerali ◽

Fabio Urbina ◽

Kimberley M. Zorn ◽

...

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Deep Learning ◽

Drug Discovery ◽

Deep Neural Networks ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods

Machine learning methods are attracting considerable attention from the pharmaceutical industry for use in drug discovery and applications beyond. In recent studies we have applied multiple machine learning algorithms, modeling metrics and in some cases compared molecular descriptors to build models for individual targets or properties on a relatively small scale. Several research groups have used large numbers of datasets from public databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The largest of these types of studies used on the order of 1400 datasets. We have now extracted well over 5000 datasets from CHEMBL for use with the ECFP6 fingerprint and comparison of our proprietary software Assay CentralTM with random forest, k-Nearest Neighbors, support vector classification, naïve Bayesian, AdaBoosted decision trees, and deep neural networks (3 levels). Model performance <a>was</a> assessed using an array of five-fold cross-validation metrics including area-under-the-curve, F1 score, Cohen’s kappa and Matthews correlation coefficient. <a>Based on ranked normalized scores for the metrics or datasets all methods appeared comparable while the distance from the top indicated Assay CentralTM and support vector classification were comparable. </a>Unlike prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case where minimal tuning was performed of any of the methods. If anything, Assay CentralTM may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing over 570,000 unique compounds was based on Assay CentralTMperformance, but support vector classification seems to be a strong competitor. We also apply Assay CentralTM to prospective predictions for PXR and hERG to further validate these models. This work currently appears to be the largest comparison of machine learning algorithms to date. Future studies will likely evaluate additional databases, descriptors and algorithms, as well as further refining methods for evaluating and comparing models.

Download Full-text

PREDICTING TIME BEFORE THE NEXT ORDER IN THE ONLINE STORE, BASED ON MACHINE LEARNING METHODS

10.32782/2224-6282/161-27 ◽

2020 ◽

Author(s):

Olena Piskunova ◽

◽

Rostyslav Klochko ◽

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Random Forest ◽

Rapid Development ◽

Confusion Matrix ◽

Individual Characteristics ◽

Support Vector ◽

Learning Methods ◽

Online Store ◽

Machine Learning Methods

Due to the rapid development of e-commerce and increased competition in the retail market of Ukraine, companies are forced to look for new ways to grow their business. One of the options is to optimize business processes, in particular to increase the efficiency of marketing activities. Predicting consumer behavior is one of the most effective methods of optimizing marketing budgets by building processes based on the individual characteristics of each client. The aim of the study was to predict the behavior of online store customers, namely the time before the next order, based on machine learning methods and a comparative analysis of the effectiveness of different modeling algorithms. Five classification algorithms were implemented: linear discriminant analysis, сlassification and regression trees, random forest, support vector machine, k - nearest neighbors and comparative analysis of their efficiency was performed. Given the peculiarities of customer behavior for forecasting time to the next order, it is proposed to consider the following time intervals in the future when the customer makes the next order: up to two months, two to six months, six to fifteen months, and without order. Predicting such intervals allows us to identify customers who are more likely to make the next purchase and focus our advertising budgets on them, or build a customer experience management strategy: activate customers who have left, offer discounts to customers who are going to leave. Peculiarities of classification models quality assessment on the basis of the “confusion matrix” according to the forecasting accuracy indicators “Accuracy”, “F1”, “Recall” and “Precision” is considered. The study allowed us to give preference to the model of classification "random forest". A tenfold cross-validation was used to improve the quality of the simulation. The weighted accuracy of “F1” in the groups “Up to two months” and “two-six months” reached 62.5% and 64.1%, respectively. The developed model should reduce the influence of the human factor on the decision-making process in the construction of marketing strategies.

Download Full-text

Identifying Cancer Targets Based on Machine Learning Methods via Chou’s 5-steps Rule and General Pseudo Components

Current Topics in Medicinal Chemistry ◽

10.2174/1568026619666191016155543 ◽

2019 ◽

Vol 19 (25) ◽

pp. 2301-2317 ◽

Cited By ~ 2

Author(s):

Ruirui Liang ◽

Jiayang Xie ◽

Chi Zhang ◽

Mengying Zhang ◽

Hai Huang ◽

...

Keyword(s):

Machine Learning ◽

Growth Rate ◽

Big Data ◽

Human Genome Project ◽

Genome Project ◽

Support Vector ◽

Successful Implementation ◽

Learning Methods ◽

Machine Learning Methods ◽

Vector Machines

In recent years, the successful implementation of human genome project has made people realize that genetic, environmental and lifestyle factors should be combined together to study cancer due to the complexity and various forms of the disease. The increasing availability and growth rate of ‘big data’ derived from various omics, opens a new window for study and therapy of cancer. In this paper, we will introduce the application of machine learning methods in handling cancer big data including the use of artificial neural networks, support vector machines, ensemble learning and naïve Bayes classifiers.

Download Full-text