Empirical asset pricing via machine learning: evidence from the European stock market

AbstractThis paper evaluates the predictive performance of machine learning methods in forecasting European stock returns. Compared to a linear benchmark model, interactions and nonlinear effects help improve the predictive performance. But machine learning models must be adequately trained and tuned to overcome the high dimensionality problem and to avoid overfitting. Across all machine learning methods, the most important predictors are based on price trends and fundamental signals from valuation ratios. However, the models exhibit substantial variation in statistical predictive performance that translate into pronounced differences in economic profitability. The return and risk measures of long-only trading strategies indicate that machine learning models produce sizeable gains relative to our benchmark. Neural networks perform best, also after accounting for transaction costs. A classification-based portfolio formation, utilizing a support vector machine that avoids estimating stock-level expected returns, performs even better than the neural network architecture.

Improved Interpretability of Machine Learning Model Using Unsupervised Clustering: Predicting Time to First Treatment in Chronic Lymphocytic Leukemia

JCO Clinical Cancer Informatics ◽

10.1200/cci.18.00137 ◽

2019 ◽

pp. 1-11 ◽

Cited By ~ 4

Author(s):

David Chen ◽

Gaurav Goyal ◽

Ronald S. Go ◽

Sameer A. Parikh ◽

Che G. Ngufor

Keyword(s):

Machine Learning ◽

Risk Stratification ◽

Unsupervised Clustering ◽

Support Vector ◽

Learning Models ◽

Learning Methods ◽

Continuous Output ◽

Time To First Treatment ◽

PURPOSE Time to event is an important aspect of clinical decision making. This is particularly true when diseases have highly heterogeneous presentations and prognoses, as in chronic lymphocytic lymphoma (CLL). Although machine learning methods can readily learn complex nonlinear relationships, many methods are criticized as inadequate because of limited interpretability. We propose using unsupervised clustering of the continuous output of machine learning models to provide discrete risk stratification for predicting time to first treatment in a cohort of patients with CLL. PATIENTS AND METHODS A total of 737 treatment-naïve patients with CLL diagnosed at Mayo Clinic were included in this study. We compared predictive abilities for two survival models (Cox proportional hazards and random survival forest) and four classification methods (logistic regression, support vector machines, random forest, and gradient boosting machine). Probability of treatment was then stratified. RESULTS Machine learning methods did not yield significantly more accurate predictions of time to first treatment. However, automated risk stratification provided by clustering was able to better differentiate patients who were at risk for treatment within 1 year than models developed using standard survival analysis techniques. CONCLUSION Clustering the posterior probabilities of machine learning models provides a way to better interpret machine learning models.

Determining hypertensive patients’ beliefs towards medication and associations with medication adherence using machine learning methods

PeerJ ◽

10.7717/peerj.8286 ◽

2020 ◽

Vol 8 ◽

pp. e8286

Author(s):

Firdaus Aziz ◽

Sorayya Malek ◽

Adliah Mhd Ali ◽

Mee Sieng Wong ◽

Mogeeb Mosleh ◽

...

Keyword(s):

Machine Learning ◽

Medication Adherence ◽

Performance Measure ◽

Variable Importance ◽

Support Vector ◽

Learning Models ◽

Learning Methods ◽

Hypertensive Patients ◽

Background This study assesses the feasibility of using machine learning methods such as Random Forests (RF), Artificial Neural Networks (ANN), Support Vector Regression (SVR) and Self-Organizing Feature Maps (SOM) to identify and determine factors associated with hypertensive patients’ adherence levels. Hypertension is the medical term for systolic and diastolic blood pressure higher than 140/90 mmHg. A conventional medication adherence scale was used to identify patients’ adherence to their prescribed medication. Using machine learning applications to predict precise numeric adherence scores in hypertensive patients has not yet been reported in the literature. Methods Data from 160 hypertensive patients from a tertiary hospital in Kuala Lumpur, Malaysia, were used in this study. Variables were ranked based on their significance to adherence levels using the RF variable importance method. The backward elimination method was then performed using RF to obtain the variables significantly associated with the patients’ adherence levels. RF, SVR and ANN models were developed to predict adherence using the identified significant variables. Visualizations of the relationships between hypertensive patients’ adherence levels and variables were generated using SOM. Result Machine learning models constructed using the selected variables reported RMSE values of 1.42 for ANN, 1.53 for RF, and 1.55 for SVR. The accuracy of the dichotomised scores, calculated based on a percentage of correctly identified adherence values, was used as an additional model performance measure, resulting in accuracies of 65% (ANN), 78% (RF) and 79% (SVR), respectively. The Wilcoxon signed ranked test reported that there was no significant difference between the predictions of the machine learning models and the actual scores. The significant variables identified from the RF variable importance method were educational level, marital status, General Overuse, monthly income, and Specific Concern. Conclusion This study suggests an effective alternative to conventional methods in identifying the key variables to understand hypertensive patients’ adherence levels. This can be used as a tool to educate patients on the importance of medication in managing hypertension.

Инновационные аспекты развития науки и техники. Сборник статей VIII Международной научно-практической конференции: сборник статей, [электронное издание сетевого распространения] / Под ред. Н.В. Емельянова. – М.: “КДУ”, “Добросвет”, 2021. – 149 с. ◽

TOPICAL ISSUES OF APPLICATION OF MACHINE LEARNING METHODS IN ECONOMY

10.31453/kdu.ru.978-5-7913-1176-4-2021-28-33 ◽

2021 ◽

Author(s):

Natalia Pavlovna Persteneva ◽

◽

Darya Dmitrievn Skryleva ◽

Keyword(s):

Machine Learning ◽

Unsupervised Learning ◽

Supervised Learning ◽

Learning Model ◽

Learning Models ◽

Learning Methods ◽

Machine Learning Model ◽

Popular Classes ◽

The article discusses machine learning methods. Using the example of two popular classes: supervised learning and unsupervised learning. Variants of the main types of machine learning models for each method are presented. A generalized algorithm for building any machine learning model is formed.

P1925Machine learning versus classic electrocardiographic criteria for left ventricular hypertrophy in a young pre-participation cohort: results from the SAFE protocol study

European Heart Journal ◽

10.1093/eurheartj/ehz748.0672 ◽

2019 ◽

Vol 40 (Supplement_1) ◽

Author(s):

G Sng ◽

D Y Z Lim ◽

C H Sia ◽

J S W Lee ◽

X Y Shen ◽

...

Keyword(s):

Machine Learning ◽

Left Ventricular Hypertrophy ◽

Ventricular Hypertrophy ◽

Left Ventricular ◽

Superior Performance ◽

Learning Models ◽

Learning Methods ◽

Ecg Criteria ◽

Abstract Background/Introduction Classic electrocardiographic (ECG) criteria for left ventricular hypertrophy (LVH) have been well studied in Western populations, particularly in hypertensive patients. However, their utility in Asian populations is not well studied, and their applicability to young pre-participation cohorts is unclear. We sought to evaluate the performance of classical criteria against that of machine learning models. Aims We sought to evaluate the performance of classical criteria against the performance of novel machine learning models in the identification of LVH. Methodology Between November 2009 and December 2014, pre-participation screening ECG and subsequent echocardiographic data was collected from 13,954 males aged 16 to 22, who reported for medical screening prior to military conscription. Final diagnosis of LVH was made on echocardiography, with LVH defined as a left ventricular mass index >115g/m2. The continuous and binary forms of classical criteria were compared against machine learning models using receiver-operating characteristics (ROC) curve analysis. An 80:20 split was used to divide the data into training and test sets for the machine learning models, and three fold cross validation was used in training the models. We also compared the important variables identified by machine learning models with the input variables of classical criteria. Results Prevalence of echocardiographic LVH in this population was 0.91% (127 cases). Classical ECG criteria had poor performance in predicting LVH, with the best predictions achieved by the continuous Sokolow-Lyon (AUC = 0.63, 95% CI = 0.58–0.68) and the continuous Modified Cornell (AUC = 0.63, 95% CI = 0.58–0.68). Machine learning methods achieved superior performance – Random Forest (AUC = 0.74, 95% CI = 0.66–0.82), Gradient Boosting Machines (AUC = 0.70, 95% CI = 0.61–0.79), GLMNet (AUC = 0.78, 95% CI = 0.70–0.86). Novel and less recognized ECG parameters identified by the machine learning models as being predictive of LVH included mean QT interval, mean QRS interval, R in V4, and R in I. ROC curves of models studies Conclusion The prevalence of LVH in our population is lower than that previously reported in other similar populations. Classical ECG criteria perform poorly in this context. Machine learning methods show superior predictive performance and demonstrate non-traditional predictors of LVH from ECG data. Further research is required to improve the predictive ability of machine learning models, and to understand the underlying pathology of the novel ECG predictors identified.

Field Development Optimization Using Machine Learning Methods to Identify the Optimal Water Flooding Regime

10.2118/206533-ms ◽

2021 ◽

Author(s):

Alexey Vasilievich Timonov ◽

Arturas Rimo Shabonas ◽

Sergey Alexandrovich Schmidt

Keyword(s):

Machine Learning ◽

Optimization Algorithms ◽

Machine Learning Algorithms ◽

Water Flooding ◽

Learning Models ◽

Learning Methods ◽

Injection Wells ◽

Field Development ◽

Abstract The main technology used to optimize field development is hydrodynamic modeling, which is very costly in terms of computing resources and expert time to configure the model. And in the case of brownfields, the complexity increases exponentially. The paper describes the stages of developing a hybrid geological-physical-mathematical proxy model using machine learning methods, which allows performing multivariate calculations and predicting production including various injection well operating regimes. Based on the calculations, we search for the optimal ratio of injection volume distribution to injection wells under given infrastructural constraints. The approach implemented in this work takes into account many factors (some features of the geological structure, history of field development, mutual influence of wells, etc.) and can offer optimal options for distribution of injection volumes of injection wells without performing full-scale or sector hydrodynamic simulation. To predict production, we use machine learning methods (based on decision trees and neural networks) and methods for optimizing the target functions. As a result of this research, a unified algorithm for data verification and preprocessing has been developed for feature extraction tasks and the use of deep machine learning models as input data. Various machine learning algorithms were tested and it was determined that the highest prediction accuracy is achieved by building machine learning models based on Temporal Convolutional Networks (TCN) and gradient boosting. Developed and tested an algorithm for finding the optimal allocation of injection volumes, taking into account the existing infrastructure constraints. Different optimization algorithms are tested. It is determined that the choice and setting of boundary conditions is critical for optimization algorithms in this problem. An integrated approach was tested on terrigenous formations of the West Siberian field, where the developed algorithm showed effectiveness.

Machine learning predictive models of LDL-C in the population of eastern India and its comparison with directly measured and calculated LDL-C

Annals of Clinical Biochemistry International Journal of Laboratory Medicine ◽

10.1177/00045632211046805 ◽

2021 ◽

pp. 000456322110468

Author(s):

Anudeep P P ◽

Suchitra Kumari ◽

Aishvarya S Rajasimman ◽

Saurav Nayak ◽

Pooja Priyadarsini

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Random Forests ◽

Predictive Performance ◽

Support Vector ◽

Learning Models ◽

Complex Interactions ◽

Clinical Biochemistry Laboratory ◽

Study Laboratory ◽

Background LDL-C is a strong risk factor for cardiovascular disorders. The formulas used to calculate LDL-C showed varying performance in different populations. Machine learning models can study complex interactions between the variables and can be used to predict outcomes more accurately. The current study evaluated the predictive performance of three machine learning models—random forests, XGBoost, and support vector Rregression (SVR) to predict LDL-C from total cholesterol, triglyceride, and HDL-C in comparison to linear regression model and some existing formulas for LDL-C calculation, in eastern Indian population. Methods The lipid profiles performed in the clinical biochemistry laboratory of AIIMS Bhubaneswar during 2019–2021, a total of 13,391 samples were included in the study. Laboratory results were collected from the laboratory database. 70% of data were classified as train set and used to develop the three machine learning models and linear regression formula. These models were tested in the rest 30% of the data (test set) for validation. Performance of models was evaluated in comparison to best six existing LDL-C calculating formulas. Results LDL-C predicted by XGBoost and random forests models showed a strong correlation with directly estimated LDL-C (r = 0.98). Two machine learning models performed superior to the six existing and commonly used LDL-C calculating formulas like Friedewald in the study population. When compared in different triglycerides strata also, these two models outperformed the other methods used. Conclusion Machine learning models like XGBoost and random forests can be used to predict LDL-C with more accuracy comparing to conventional linear regression LDL-C formulas.

Combining electronic and structural features in machine learning models to predict organic solar cells properties

Materials Horizons ◽

10.1039/c8mh01135d ◽

2019 ◽

Vol 6 (2) ◽

pp. 343-349 ◽

Cited By ~ 39

Author(s):

Daniele Padula ◽

Jack D. Simpson ◽

Alessandro Troisi

Keyword(s):

Machine Learning ◽

Solar Cells ◽

Organic Solar Cells ◽

Structural Similarity ◽

Structural Features ◽

Learning Models ◽

Learning Methods ◽

Combining electronic and structural similarity between organic donors in kernel based machine learning methods allows to predict photovoltaic efficiencies reliably.

Exploring machine learning methods for absolute configuration determination with vibrational circular dichroism

Physical Chemistry Chemical Physics ◽

10.1039/d1cp02428k ◽

2021 ◽

Vol 23 (35) ◽

pp. 19781-19789

Author(s):

Tom Vermeyen ◽

Jure Brence ◽

Robin Van Echelpoel ◽

Roy Aerts ◽

Guillaume Acke ◽

...

Keyword(s):

Machine Learning ◽

Circular Dichroism ◽

Absolute Configuration ◽

Vibrational Circular Dichroism ◽

Learning Models ◽

Learning Methods ◽

The Absolute ◽

Circular Dichroïsm ◽

The capabilities of machine learning models to extract the absolute configuration of a series of compounds from their vibrational circular dichroism spectra have been demonstrated. The important spectral areas are identified.

Face Recognition System with Machine Learning

Journal of Physics Conference Series ◽

10.1088/1742-6596/2089/1/012047 ◽

2021 ◽

Vol 2089 (1) ◽

pp. 012047

Author(s):

Vuppu Padmakar ◽

B V Ramana Murthy

Keyword(s):

Machine Learning ◽

Face Recognition ◽

Open Source ◽

Programming Language ◽

Recognition System ◽

Learning Models ◽

Learning Methods ◽

Face Recognition System ◽

Abstract This venture plans to give improved security by enabling a client to realize who is actually getting to the framework utilizing facial acknowledgment. The framework enables just approved clients to get entrance. Python is a programming language utilized alongside Machine learning methods and an open source library which is utilized to configuration, construct and train Machine learning models. Interface component is additionally accommodated unapproved clients to enroll to obtain entrance with the earlier authorization from the Admin.

Systematic comparison of five machine-learning methods in classification and interpolation of soil particle size fractions using different transformed data

10.5194/hess-2019-648 ◽

2020 ◽

Author(s):

Mo Zhang ◽

Wenjiao Shi

Keyword(s):

Machine Learning ◽

Soil Texture ◽

Texture Classification ◽

Learning Models ◽

Learning Methods ◽

Particle Size Fractions ◽

Transformation Methods ◽

Log Ratio ◽

Abstract. Soil texture and soil particle size fractions (PSFs) play an increasing role in physical, chemical and hydrological processes. Many previous studies have used machine-learning and log ratio transformation methods for soil texture classification and soil PSFs interpolation to improve the prediction accuracy. However, few reports systematically compared the performance of them in both classification and interpolation. Here, a total of 45 evaluation models generated from five machine-learning models – K-nearest neighbor (KNN), multilayer perceptron neural network (MLP), random forest (RF), support vector machines (SVM), extreme gradient boosting (XGB), combined with original and three log ratio methods – additive log ratio (ALR), centered log ratio (CLR) and isometric log ratio (ILR), were applied to evaluate and compare both of them using 640 soil samples in the Heihe River Basin in China. The results demonstrated that log ratio transformation methods decreased skewness of distributions of soil PSFs data. For soil texture classification, RF and XGB showed better performance with the overall accuracy and kappa coefficients, they were also recommended to evaluate classification capacity of imbalanced data according to the area under the precision-recall curve (AUPRC) analysis. For soil PSFs interpolation, RF delivered the best performance among five machine-learning models with the lowest root mean squared error (RMSE, sand: 15.09 %, silt: 13.86 %, clay: 6.31 %), mean absolute error (MAE, sand: 10.65 %, silt: 9.99 %, clay: 5.00 %), Aitchison distance (AD, 0.84) and standardized residual sum of squares (STRESS, 0.61), and the highest coefficient of determination (R2, sand: 53.28 %, silt: 45.77 %, clay: 53.75 %). STRESS was improved using log ratio methods, especially CLR and ILR. For the comparison of direct and indirect classification, prediction maps were similar on the middle and upper reaches and different on the lower reaches of the HRB. Moreover, indirect classification maps based on log ratio transformed data had more detailed information. There is a pronounced improvement with 21.3 % of kappa coefficient using indirect methods for soil texture classification compared to the direct ones. RF was recommended as the best strategy among these five machine-learning models according to the accuracy evaluation of soil PSFs interpolation and soil texture classification, and ILR was recommended for component-wise machine-learning methods without multivariate treatment considering the constrained nature of compositional data. In addition, XGB was preferred than other models when trade-off of accuracy and time was considered. Our findings can provide a reference for other research of spatial prediction of soil PSFs and texture using machine-learning methods with skewed distribution soil PSFs data in a large area.