Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement

Ernest Kwame Ampomah; Zhiguang Qin; Gabriel Nyame

doi:10.3390/info11060332

Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement

Information ◽

10.3390/info11060332 ◽

2020 ◽

Vol 11 (6) ◽

pp. 332

Author(s):

Ernest Kwame Ampomah ◽

Zhiguang Qin ◽

Gabriel Nyame

Keyword(s):

Machine Learning ◽

Stock Market ◽

Stock Price ◽

Superior Performance ◽

Operating Characteristics ◽

Training Set ◽

Data Set ◽

Test Set ◽

Ensemble Machine Learning ◽

Better Than

Forecasting the direction and trend of stock price is an important task which helps investors to make prudent financial decisions in the stock market. Investment in the stock market has a big risk associated with it. Minimizing prediction error reduces the investment risk. Machine learning (ML) models typically perform better than statistical and econometric models. Also, ensemble ML models have been shown in the literature to be able to produce superior performance than single ML models. In this work, we compare the effectiveness of tree-based ensemble ML models (Random Forest (RF), XGBoost Classifier (XG), Bagging Classifier (BC), AdaBoost Classifier (Ada), Extra Trees Classifier (ET), and Voting Classifier (VC)) in forecasting the direction of stock price movement. Eight different stock data from three stock exchanges (NYSE, NASDAQ, and NSE) are randomly collected and used for the study. Each data set is split into training and test set. Ten-fold cross validation accuracy is used to evaluate the ML models on the training set. In addition, the ML models are evaluated on the test set using accuracy, precision, recall, F1-score, specificity, and area under receiver operating characteristics curve (AUC-ROC). Kendall W test of concordance is used to rank the performance of the tree-based ML algorithms. For the training set, the AdaBoost model performed better than the rest of the models. For the test set, accuracy, precision, F1-score, and AUC metrics generated results significant to rank the models, and the Extra Trees classifier outperformed the other models in all the rankings.

Download Full-text

Abstract 349: Prognostication for Out-of-Hospital Cardiogenic Cardiac Arrest Patients Using Advanced Machine Learning Technique

Circulation ◽

10.1161/circ.138.suppl_2.349 ◽

2018 ◽

Vol 138 (Suppl_2) ◽

Author(s):

Tomohisa Seki ◽

Tomoyoshi Tamura ◽

Masaru Suzuki

Keyword(s):

Machine Learning ◽

Cardiac Arrest ◽

Study Data ◽

Machine Learning Techniques ◽

Operating Characteristics ◽

Training Set ◽

Machine Learning Technique ◽

Test Set ◽

Medical Institutions ◽

Learning Technique

Introduction and Objective: Early prognostication for cardiogenic out-of-hospital cardiac arrest (OHCA) patients remain challenging. Recently, advanced machine learning techniques have been employed for clinical diagnosis and prognostication for various conditions. Therefore, in this study, we attempted to establish a prognostication model for cardiogenic OHCA using an advanced machine learning technique. Methods and Results: Data of a prospective multi-center cohort study of OHCA patients transported by an ambulance to 67 medical institutions in Kanto area of Japan between January 2012 and March 2013 was used in this study. Data for cardiogenic OHCA patients aged ≥18 years were retrieved and patients were grouped according to the time of calls for ambulances (training set: between January 1, 2012 and December 12, 2012; test set: between January 1, 2013 and March 31, 2013). From among 421 variables observed during the period between calls for ambulances and initial in-hospital treatments of cardiogenic OHCA, 38 prehospital factors or 56 prehospital factors and initial in-hospital factors were used for prognostication, respectively. Prognostication models for 1-year survival were established with random forest method, an advanced machine learning method that aggregates a series of decision trees for classification and regression. After 10-fold internal cross validation in the training set, prognostication models were validated using test set. Area under the receiver operating characteristics curve (AUC) was used to evaluate the prediction performance of models. Prognostication models trained with 38 variables or 56 variables for 1-year survival showed AUC values of 0.93±0.01 and 0.95±0.01, respectively. Conclusions: Prognostication models trained with advanced machine learning technique showed favorable prediction capability for 1-year survival of cardiogenic OHCA. These results indicate that an advanced machine learning technique can be applicable to establish early prognostication model for cardiogenic OHCA.

Download Full-text

Classical scoring functions for docking are unable to exploit large volumes of structural and interaction data

Bioinformatics ◽

10.1093/bioinformatics/btz183 ◽

2019 ◽

Vol 35 (20) ◽

pp. 3989-3995 ◽

Cited By ~ 17

Author(s):

Hongjian Li ◽

Jiangjun Peng ◽

Pavel Sidorov ◽

Yee Leung ◽

Kwong-Sak Leung ◽

...

Keyword(s):

Machine Learning ◽

Protein Structures ◽

Superior Performance ◽

Supplementary Information ◽

Scoring Functions ◽

Training Set ◽

Test Set ◽

Set Size ◽

Extreme Gradient Boosting ◽

The Impact

Abstract Motivation Studies have shown that the accuracy of random forest (RF)-based scoring functions (SFs), such as RF-Score-v3, increases with more training samples, whereas that of classical SFs, such as X-Score, does not. Nevertheless, the impact of the similarity between training and test samples on this matter has not been studied in a systematic manner. It is therefore unclear how these SFs would perform when only trained on protein-ligand complexes that are highly dissimilar or highly similar to the test set. It is also unclear whether SFs based on machine learning algorithms other than RF can also improve accuracy with increasing training set size and to what extent they learn from dissimilar or similar training complexes. Results We present a systematic study to investigate how the accuracy of classical and machine-learning SFs varies with protein-ligand complex similarities between training and test sets. We considered three types of similarity metrics, based on the comparison of either protein structures, protein sequences or ligand structures. Regardless of the similarity metric, we found that incorporating a larger proportion of similar complexes to the training set did not make classical SFs more accurate. In contrast, RF-Score-v3 was able to outperform X-Score even when trained on just 32% of the most dissimilar complexes, showing that its superior performance owes considerably to learning from dissimilar training complexes to those in the test set. In addition, we generated the first SF employing Extreme Gradient Boosting (XGBoost), XGB-Score, and observed that it also improves with training set size while outperforming the rest of SFs. Given the continuous growth of training datasets, the development of machine-learning SFs has become very appealing. Availability and implementation https://github.com/HongjianLi/MLSF Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Evolutionary Learning-Derived Clinical-Radiomic Models for Predicting Early Recurrence of Hepatocellular Carcinoma after Resection

Liver Cancer ◽

10.1159/000518728 ◽

2021 ◽

pp. 1-11

Author(s):

I-Cheng Lee ◽

Jo-Yu Huang ◽

Ting-Chun Chen ◽

Chia-Heng Yen ◽

Nai-Chi Chiu ◽

...

Keyword(s):

Machine Learning ◽

Hepatocellular Carcinoma ◽

Surgical Resection ◽

Clinical Features ◽

Prediction Models ◽

Early Recurrence ◽

Evolutionary Learning ◽

Training Set ◽

Test Set ◽

Better Than

Background and Aims: Current prediction models for early recurrence of hepatocellular carcinoma (HCC) after surgical resection remain unsatisfactory. The aim of this study was to develop evolutionary learning-derived prediction models with interpretability using both clinical and radiomic features to predict early recurrence of HCC after surgical resection. Methods: Consecutive 517 HCC patients receiving surgical resection with available contrast-enhanced computed tomography (CECT) images before resection were retrospectively enrolled. Patients were randomly assigned to a training set (n = 362) and a test set (n = 155) in a ratio of 7:3. Tumor segmentation of all CECT images including noncontrast phase, arterial phase, and portal venous phase was manually performed for radiomic feature extraction. A novel evolutionary learning-derived method called genetic algorithm for predicting recurrence after surgery of liver cancer (GARSL) was proposed to design prediction models for early recurrence of HCC within 2 years after surgery. Results: A total of 143 features, including 26 preoperative clinical features, 5 postoperative pathological features, and 112 radiomic features were used to develop GARSL preoperative and postoperative models. The area under the receiver operating characteristic curves (AUCs) for early recurrence of HCC within 2 years were 0.781 and 0.767, respectively, in the training set, and 0.739 and 0.741, respectively, in the test set. The accuracy of GARSL models derived from the evolutionary learning method was significantly better than models derived from other well-known machine learning methods or the early recurrence after surgery for liver tumor (ERASL) preoperative (AUC = 0.687, p < 0.001 vs. GARSL preoperative) and ERASL postoperative (AUC = 0.688, p < 0.001 vs. GARSL postoperative) models using clinical features only. Conclusion: The GARSL models using both clinical and radiomic features significantly improved the accuracy to predict early recurrence of HCC after surgical resection, which was significantly better than other well-known machine learning-derived models and currently available clinical models.

Download Full-text

Stock Price Pattern Prediction Based on Complex Network and Machine Learning

Complexity ◽

10.1155/2019/4132485 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Hongduo Cao ◽

Tiantian Lin ◽

Ying Li ◽

Hanyu Zhang

Keyword(s):

Machine Learning ◽

Stock Market ◽

Complex Network ◽

Stock Price ◽

Price Volatility ◽

Future Trend ◽

Support Vector ◽

Data Set ◽

Prediction Ability ◽

Pattern Prediction

Complex networks in stock market and stock price volatility pattern prediction are the important issues in stock price research. Previous studies have used historical information regarding a single stock to predict the future trend of the stock’s price, seldom considering comovement among stocks in the same market. In this study, in order to extract the information about relation stocks for prediction, we try to combine the complex network method with machine learning to predict stock price patterns. Firstly, we propose a new pattern network construction method for multivariate stock time series. The price volatility combination patterns of the Standard & Poor’s 500 Index (S&P 500), the NASDAQ Composite Index (NASDAQ), and the Dow Jones Industrial Average (DJIA) are transformed into directed weighted networks. It is found that network topology characteristics, such as average degree centrality, average strength, average shortest path length, and closeness centrality, can identify periods of sharp fluctuations in the stock market. Next, the topology characteristic variables for each combination symbolic pattern are used as the input variables for K-nearest neighbors (KNN) and support vector machine (SVM) algorithms to predict the next-day volatility patterns of a single stock. The results show that the optimal models corresponding to the two algorithms can be found through cross-validation and search methods, respectively. The prediction accuracy rates for the three indexes in relation to the testing data set are greater than 70%. In general, the prediction ability of SVM algorithms is better than that of KNN algorithms.

Download Full-text

Development and validation of prognosis model of mortality risk in patients with COVID-19

Epidemiology and Infection ◽

10.1017/s0950268820001727 ◽

2020 ◽

Vol 148 ◽

Cited By ~ 2

Author(s):

Xuedi Ma ◽

Michael Ng ◽

Shuang Xu ◽

Zhouming Xu ◽

Hui Qiu ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Clinical Features ◽

Mortality Risk ◽

Operating Characteristics ◽

Multivariate Logistic Regression ◽

Learning Methods ◽

Data Set ◽

Machine Learning Methods ◽

Better Than

Abstract This study aimed to identify clinical features for prognosing mortality risk using machine-learning methods in patients with coronavirus disease 2019 (COVID-19). A retrospective study of the inpatients with COVID-19 admitted from 15 January to 15 March 2020 in Wuhan is reported. The data of symptoms, comorbidity, demographic, vital sign, CT scans results and laboratory test results on admission were collected. Machine-learning methods (Random Forest and XGboost) were used to rank clinical features for mortality risk. Multivariate logistic regression models were applied to identify clinical features with statistical significance. The predictors of mortality were lactate dehydrogenase (LDH), C-reactive protein (CRP) and age based on 500 bootstrapped samples. A multivariate logistic regression model was formed to predict mortality 292 in-sample patients with area under the receiver operating characteristics (AUROC) of 0.9521, which was better than CURB-65 (AUROC of 0.8501) and the machine-learning-based model (AUROC of 0.4530). An out-sample data set of 13 patients was further tested to show our model (AUROC of 0.6061) was also better than CURB-65 (AUROC of 0.4608) and the machine-learning-based model (AUROC of 0.2292). LDH, CRP and age can be used to identify severe patients with COVID-19 on hospital admission.

Download Full-text

Abstract 16725: Predicting the Need for a Permanent Pacemaker Device Following Transcatheter Aortic Valve Replacement Using Supervised Machine Learning

Circulation ◽

10.1161/circ.142.suppl_3.16725 ◽

2020 ◽

Vol 142 (Suppl_3) ◽

Author(s):

Wasiq Sheikh ◽

Anshul Parulkar ◽

Malik B Ahmed ◽

Suleman Ilyas ◽

Esseim Sharma ◽

...

Keyword(s):

Machine Learning ◽

Random Forests ◽

Cost Savings ◽

Permanent Pacemaker ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Training Set ◽

Data Set ◽

Test Set ◽

Discriminatory Ability

Introduction: TAVR is approved for use for a range of patients with aortic stenosis. The need for a permanent pacemaker device (PPD) after TAVR varies, but can range from 2 to 51%. Risk factors for requiring PPD after TAVR appear to include being male and having baseline conduction disturbances. Being able to predict who may require PPD could identify at risk patients early and may confer cost savings. Given the advent of machine learning classifier techniques, random forests may aid in better predicting need for PPD after TAVR by using pre-operative variables. Hypothesis: Random Forests offer discriminatory ability in predicting the need for PPD after TAVR using primarily pre-operative variables. Methods: Pre-operative data from a single institution were collected patients undergoing TAVR without a history of PPD between January 2016 and December 2019. EKG data was obtained including underlying rhythm, QRS duration and any underlying conduction abnormality. Other variables included anti-arrhythmic data, comorbidities, and eGFR. Data was imported into Python and a stratified 5 fold cross validation with SMOTE oversampling running at every fold to avoid overfitting was run on the training set. The model that optimized the receiver under the operator curve was exported and applied to a test data set. Precision and recall were calculated to assess classification. Results: A total of 513 patients were identified with nearly 9% eventually requiring PPD. A total of 40 predictor variables were utilized in the modeling. A stratified split of the data resulted in a training set of 384 patients and a test set of 129 patients. A total of 500 trees were used on the training set. The final optimized model had an ROC of 0.71 with the following parameters: gini criterion, max depth of 4, and logarithm max features . When applied to the test set, the model had an ROC of 0.63. Overall accuracy was 0.78, with a precision and recall for no PPD after TAVR being 0.94 and 0.81 and a precision and recall for PPD after TAVR of 0.18 and 0.45. Conclusions: Our results show that machine learning techniques, specifically random forests have discriminatory ability in predicting PPD after TAVR. More tuning of the models are required to achieve better discrimination.

Download Full-text

An Optimized Stacking Ensemble Model for Phishing Websites Detection

Electronics ◽

10.3390/electronics10111285 ◽

2021 ◽

Vol 10 (11) ◽

pp. 1285

Author(s):

Mohammed Al-Sarem ◽

Faisal Saeed ◽

Zeyad Ghaleb Al-Mekhlafi ◽

Badiea Abdulkarem Mohammed ◽

Tawfik Al-Hadhrami ◽

...

Keyword(s):

Machine Learning ◽

Random Forests ◽

Ensemble Method ◽

Detection Methods ◽

Detection Accuracy ◽

Ensemble Model ◽

Security Attacks ◽

Data Set ◽

Machine Learning Methods ◽

Ensemble Machine Learning

Security attacks on legitimate websites to steal users’ information, known as phishing attacks, have been increasing. This kind of attack does not just affect individuals’ or organisations’ websites. Although several detection methods for phishing websites have been proposed using machine learning, deep learning, and other approaches, their detection accuracy still needs to be enhanced. This paper proposes an optimized stacking ensemble method for phishing website detection. The optimisation was carried out using a genetic algorithm (GA) to tune the parameters of several ensemble machine learning methods, including random forests, AdaBoost, XGBoost, Bagging, GradientBoost, and LightGBM. The optimized classifiers were then ranked, and the best three models were chosen as base classifiers of a stacking ensemble method. The experiments were conducted on three phishing website datasets that consisted of both phishing websites and legitimate websites—the Phishing Websites Data Set from UCI (Dataset 1); Phishing Dataset for Machine Learning from Mendeley (Dataset 2, and Datasets for Phishing Websites Detection from Mendeley (Dataset 3). The experimental results showed an improvement using the optimized stacking ensemble method, where the detection accuracy reached 97.16%, 98.58%, and 97.39% for Dataset 1, Dataset 2, and Dataset 3, respectively.

Download Full-text

Stock Market Prediction Using Machine Learning Techniques: A Decade Survey on Methodologies, Recent Developments, and Future Directions

Electronics ◽

10.3390/electronics10212717 ◽

2021 ◽

Vol 10 (21) ◽

pp. 2717

Author(s):

Nusrat Rouf ◽

Majid Bashir Malik ◽

Tasleem Arif ◽

Sparsh Sharma ◽

Saurabh Singh ◽

...

Keyword(s):

Machine Learning ◽

Stock Market ◽

Digital Libraries ◽

Stock Price ◽

Ensemble Methods ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Stock Market Prediction ◽

Research Areas ◽

Recent Developments

With the advent of technological marvels like global digitization, the prediction of the stock market has entered a technologically advanced era, revamping the old model of trading. With the ceaseless increase in market capitalization, stock trading has become a center of investment for many financial investors. Many analysts and researchers have developed tools and techniques that predict stock price movements and help investors in proper decision-making. Advanced trading models enable researchers to predict the market using non-traditional textual data from social platforms. The application of advanced machine learning approaches such as text data analytics and ensemble methods have greatly increased the prediction accuracies. Meanwhile, the analysis and prediction of stock markets continue to be one of the most challenging research areas due to dynamic, erratic, and chaotic data. This study explains the systematics of machine learning-based approaches for stock market prediction based on the deployment of a generic framework. Findings from the last decade (2011–2021) were critically analyzed, having been retrieved from online digital libraries and databases like ACM digital library and Scopus. Furthermore, an extensive comparative analysis was carried out to identify the direction of significance. The study would be helpful for emerging researchers to understand the basics and advancements of this emerging area, and thus carry-on further research in promising directions.

Download Full-text

Stock Market Prediction using Machine Learning

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.35755 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 3628-3632

Author(s):

Prof. Gowrishankar B S

Keyword(s):

Machine Learning ◽

Stock Market ◽

Stock Price ◽

Predictive Analytics ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Stock Market Prediction ◽

The Future ◽

Complicated Model

Stock market is one of the most complicated and sophisticated ways to do business. Small ownerships, brokerage corporations, banking sectors, all depend on this very body to make revenue and divide risks; a very complicated model. However, this paper proposes to use machine learning algorithms to predict the future stock price for exchange by using pre-existing algorithms to help make this unpredictable format of business a little more predictable. The use of machine learning which makes predictions based on the values of current stock market indices by training on their previous values. Machine learning itself employs different models to make prediction easier and authentic. The data has to be cleansed before it can be used for predictions. This paper focuses on categorizing various methods used for predictive analytics in different domains to date, their shortcomings.

Download Full-text

Application of Multi-Scale Fusion Attention U-Net to Segment the Thyroid Gland on CT Localization Images for Radiotherapy

10.21203/rs.3.rs-949323/v1 ◽

2021 ◽

Author(s):

Xiaobo Wen ◽

Biao Zhao ◽

Meifang Yuan ◽

Jinzhi Li ◽

Mengzhen Sun ◽

...

Keyword(s):

Thyroid Gland ◽

Clinical Work ◽

Similarity Coefficient ◽

Dice Similarity Coefficient ◽

Training Set ◽

Data Set ◽

Test Set ◽

Noise Interference ◽

Multi Scale ◽

Validation Set

Abstract Objectives: To explore the performance of Multi-scale Fusion Attention U-net (MSFA-U-net) in thyroid gland segmentation on CT localization images for radiotherapy. Methods: CT localization images for radiotherapy of 80 patients with breast cancer or head and neck tumors were selected; label images were manually delineated by experienced radiologists. The data set was randomly divided into the training set (n=60), the validation set (n=10), and the test set (n=10). Data expansion was performed in the training set, and the performance of the MSFA-U-net model was evaluated using the evaluation indicators Dice similarity coefficient (DSC), Jaccard similarity coefficient (JSC), positive predictive value (PPV), sensitivity (SE), and Hausdorff distance (HD). Results: With the MSFA-U-net model, the DSC, JSC, PPV, SE, and HD indexes of the segmented thyroid gland in the test set were 0.8967±0.0935, 0.8219±0.1115, 0.9065±0.0940, 0.8979±0.1104, and 2.3922±0.5423, respectively. Compared with U-net, HR-net, and Attention U-net, MSFA-U-net showed that DSC increased by 0.052, 0.0376, and 0.0346 respectively; JSC increased by 0.0569, 0.0805, and 0.0433, respectively; SE increased by 0.0361, 0.1091, and 0.0831, respectively; and HD increased by −0.208, −0.1952, and −0.0548, respectively. The test set image results showed that the thyroid edges segmented by the MSFA-U-net model were closer to the standard thyroid delineated by the experts, in comparison with those segmented by the other three models. Moreover, the edges were smoother, over-anti-noise interference was stronger, and oversegmentation and undersegmentation were reduced. Conclusion: The MSFA-U-net model can meet basic clinical requirements and improve the efficiency of physicians' clinical work.

Download Full-text