Forecasting US movies box office performances in Turkey using machine learning algorithms

The motion picture industry is one of the largest industries worldwide and has significant importance in the global economy. Considering the high stakes and high risks in the industry, forecast models and decision support systems are gaining importance. Several attempts have been made to estimate the theatrical performance of a movie before or at the early stages of its release. Nevertheless, these models are mostly used for predicting domestic performances and the industry still struggles to predict box office performances in overseas markets. In this study, the aim is to design a forecast model using different machine learning algorithms to estimate the theatrical success of US movies in Turkey. From various sources, a dataset of 1559 movies is constructed. Firstly, independent variables are grouped as pre-release, distributor type, and international distribution based on their characteristic. The number of attendances is discretized into three classes. Four popular machine learning algorithms, artificial neural networks, decision tree regression and gradient boosting tree and random forest are employed, and the impact of each group is observed by compared by the performance models. Then the number of target classes is increased into five and eight and results are compared with the previously developed models in the literature.

Download Full-text

On the Application of Advanced Machine Learning Methods to Analyze Enhanced, Multimodal Data from Persons Infected with COVID-19

Computation ◽

10.3390/computation9010004 ◽

2021 ◽

Vol 9 (1) ◽

pp. 4

Author(s):

Wenhuan Zeng ◽

Anupam Gautam ◽

Daniel H. Huson

Keyword(s):

Machine Learning ◽

Language Processing ◽

Climatic Factors ◽

Sequence Data ◽

Learning Algorithms ◽

Weather Conditions ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Extreme Gradient Boosting ◽

The Impact

The current COVID-19 pandemic, caused by the rapid worldwide spread of the SARS-CoV-2 virus, is having severe consequences for human health and the world economy. The virus affects different individuals differently, with many infected patients showing only mild symptoms, and others showing critical illness. To lessen the impact of the epidemic, one problem is to determine which factors play an important role in a patient’s progression of the disease. Here, we construct an enhanced COVID-19 structured dataset from more than one source, using natural language processing to add local weather conditions and country-specific research sentiment. The enhanced structured dataset contains 301,363 samples and 43 features, and we applied both machine learning algorithms and deep learning algorithms on it so as to forecast patient’s survival probability. In addition, we import alignment sequence data to improve the performance of the model. Application of Extreme Gradient Boosting (XGBoost) on the enhanced structured dataset achieves 97% accuracy in predicting patient’s survival; with climatic factors, and then age, showing the most importance. Similarly, the application of a Multi-Layer Perceptron (MLP) achieves 98% accuracy. This work suggests that enhancing the available data, mostly basic information on patients, so as to include additional, potentially important features, such as weather conditions, is useful. The explored models suggest that textual weather descriptions can improve outcome forecast.

Download Full-text

Comparative Analysis of Two Machine Learning Algorithms in Predicting Site-Level Net Ecosystem Exchange in Major Biomes

Remote Sensing ◽

10.3390/rs13122242 ◽

2021 ◽

Vol 13 (12) ◽

pp. 2242

Author(s):

Jianzhao Liu ◽

Yunjiang Zuo ◽

Nannan Wang ◽

Fenghui Yuan ◽

Xinhao Zhu ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Substantial Reduction ◽

Model Performance ◽

Extreme Heat ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Ecological Data ◽

Extreme Gradient Boosting ◽

The Impact

The net ecosystem CO2 exchange (NEE) is a critical parameter for quantifying terrestrial ecosystems and their contributions to the ongoing climate change. The accumulation of ecological data is calling for more advanced quantitative approaches for assisting NEE prediction. In this study, we applied two widely used machine learning algorithms, Random Forest (RF) and Extreme Gradient Boosting (XGBoost), to build models for simulating NEE in major biomes based on the FLUXNET dataset. Both models accurately predicted NEE in all biomes, while XGBoost had higher computational efficiency (6~62 times faster than RF). Among environmental variables, net solar radiation, soil water content, and soil temperature are the most important variables, while precipitation and wind speed are less important variables in simulating temporal variations of site-level NEE as shown by both models. Both models perform consistently well for extreme climate conditions. Extreme heat and dryness led to much worse model performance in grassland (extreme heat: R2 = 0.66~0.71, normal: R2 = 0.78~0.81; extreme dryness: R2 = 0.14~0.30, normal: R2 = 0.54~0.55), but the impact on forest is less (extreme heat: R2 = 0.50~0.78, normal: R2 = 0.59~0.87; extreme dryness: R2 = 0.86~0.90, normal: R2 = 0.81~0.85). Extreme wet condition did not change model performance in forest ecosystems (with R2 changing −0.03~0.03 compared with normal) but led to substantial reduction in model performance in cropland (with R2 decreasing 0.20~0.27 compared with normal). Extreme cold condition did not lead to much changes in model performance in forest and woody savannas (with R2 decreasing 0.01~0.08 and 0.09 compared with normal, respectively). Our study showed that both models need training samples at daily timesteps of >2.5 years to reach a good model performance and >5.4 years of daily samples to reach an optimal model performance. In summary, both RF and XGBoost are applicable machine learning algorithms for predicting ecosystem NEE, and XGBoost algorithm is more feasible than RF in terms of accuracy and efficiency.

Download Full-text

Predictability of Mortality in Patients With Myocardial Injury After Noncardiac Surgery Based on Perioperative Factors via Machine Learning: Retrospective Study

JMIR Medical Informatics ◽

10.2196/32771 ◽

2021 ◽

Vol 9 (10) ◽

pp. e32771

Author(s):

Seo Jeong Shin ◽

Jungchan Park ◽

Seung-Hwa Lee ◽

Kwangmo Yang ◽

Rae Woong Park

Keyword(s):

Machine Learning ◽

Clinical Data ◽

Myocardial Injury ◽

Learning Algorithms ◽

Noncardiac Surgery ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Data Set ◽

Extreme Gradient Boosting ◽

The Impact

Background Myocardial injury after noncardiac surgery (MINS) is associated with increased postoperative mortality, but the relevant perioperative factors that contribute to the mortality of patients with MINS have not been fully evaluated. Objective To establish a comprehensive body of knowledge relating to patients with MINS, we researched the best performing predictive model based on machine learning algorithms. Methods Using clinical data from 7629 patients with MINS from the clinical data warehouse, we evaluated 8 machine learning algorithms for accuracy, precision, recall, F1 score, area under the receiver operating characteristic (AUROC) curve, and area under the precision-recall curve to investigate the best model for predicting mortality. Feature importance and Shapley Additive Explanations values were analyzed to explain the role of each clinical factor in patients with MINS. Results Extreme gradient boosting outperformed the other models. The model showed an AUROC of 0.923 (95% CI 0.916-0.930). The AUROC of the model did not decrease in the test data set (0.894, 95% CI 0.86-0.922; P=.06). Antiplatelet drugs prescription, elevated C-reactive protein level, and beta blocker prescription were associated with reduced 30-day mortality. Conclusions Predicting the mortality of patients with MINS was shown to be feasible using machine learning. By analyzing the impact of predictors, markers that should be cautiously monitored by clinicians may be identified.

Download Full-text

Predictability of Mortality in Patients with Myocardial Injury after Noncardiac Surgery Based on Perioperative factors via Machine Learning (Preprint)

10.2196/preprints.32771 ◽

2021 ◽

Author(s):

Seo Jeong Shin ◽

Jungchan Park ◽

Seung-Hwa Lee ◽

Kwangmo Yang ◽

Rae Woong Park

Keyword(s):

Machine Learning ◽

Clinical Data ◽

Myocardial Injury ◽

Learning Algorithms ◽

Postoperative Mortality ◽

Noncardiac Surgery ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Extreme Gradient Boosting ◽

The Impact

BACKGROUND Myocardial injury after noncardiac surgery (MINS) is associated with increased postoperative mortality, but the relevant perioperative factors that contribute to the mortality of patients with MINS have not been fully evaluated. OBJECTIVE To establish a comprehensive body of knowledge relating to patients with MINS, we researched the best performing predictive model based on machine learning algorithms. METHODS Using clinical data for 7,629 patients with MINS from the Clinical Data Warehouse, we evaluated eight machine learning algorithms for accuracy, precision, recall, F1 score, AUROC (area under the receiver operating characteristic) curve, and area under the precision-recall curve to investigate the best model for predicting mortality. Feature importance and SHapley Additive exPlanations value were analyzed to explain the role of each clinical factor in patients with MINS. RESULTS Extreme gradient boosting outperformed the other models. The model showed AUROC of 0.923 (95% confidence interval (CI): 0.916–0.930). The AUROC of the model was not decreased in the test dataset (0.894, 95% CI: 0.86–0.922) (P =.06). Antiplatelet drugs prescription, elevated C-reactive protein level, and beta blocker prescription were associated with reduced 30-day mortality. CONCLUSIONS Predicting mortality of patients with MINS was shown to be feasible using machine learning. By analyzing the impact of predictors, markers that should be cautiously monitored by clinicians may be identified.

Download Full-text

Feasibility of Machine Learning Algorithms for Predicting the Deformation of Anodic Titanium Films by Modulating Anodization Processes

Materials ◽

10.3390/ma14051089 ◽

2021 ◽

Vol 14 (5) ◽

pp. 1089

Author(s):

Sung-Hee Kim ◽

Chanyoung Jeong

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Multiclass Classification ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Smart Manufacturing ◽

Gradient Boosting ◽

Experimental Conditions ◽

Learning Techniques ◽

Tio2 Nanostructures

This study aims to demonstrate the feasibility of applying eight machine learning algorithms to predict the classification of the surface characteristics of titanium oxide (TiO2) nanostructures with different anodization processes. We produced a total of 100 samples, and we assessed changes in TiO2 nanostructures’ thicknesses by performing anodization. We successfully grew TiO2 films with different thicknesses by one-step anodization in ethylene glycol containing NH4F and H2O at applied voltage differences ranging from 10 V to 100 V at various anodization durations. We found that the thicknesses of TiO2 nanostructures are dependent on anodization voltages under time differences. Therefore, we tested the feasibility of applying machine learning algorithms to predict the deformation of TiO2. As the characteristics of TiO2 changed based on the different experimental conditions, we classified its surface pore structure into two categories and four groups. For the classification based on granularity, we assessed layer creation, roughness, pore creation, and pore height. We applied eight machine learning techniques to predict classification for binary and multiclass classification. For binary classification, random forest and gradient boosting algorithm had relatively high performance. However, all eight algorithms had scores higher than 0.93, which signifies high prediction on estimating the presence of pore. In contrast, decision tree and three ensemble methods had a relatively higher performance for multiclass classification, with an accuracy rate greater than 0.79. The weakest algorithm used was k-nearest neighbors for both binary and multiclass classifications. We believe that these results show that we can apply machine learning techniques to predict surface quality improvement, leading to smart manufacturing technology to better control color appearance, super-hydrophobicity, super-hydrophilicity or batter efficiency.

Download Full-text

Analyze the impact of the epidemic on New York taxis by machine learning algorithms and recommendations for optimal prediction algorithms

10.1145/3475851.3475861 ◽

2021 ◽

Author(s):

Zheng Liu ◽

Xinjing Xia ◽

Haipeng Zhang ◽

Zihui Xie

Keyword(s):

Machine Learning ◽

New York ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Optimal Prediction ◽

Prediction Algorithms ◽

The Impact

Download Full-text

Predicting hospitalization following psychiatric crisis care using machine learning

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-020-01361-1 ◽

2020 ◽

Vol 20 (1) ◽

Author(s):

Matthijs Blankers ◽

Louk F. M. van der Post ◽

Jack J. M. Dekker

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Prediction Models ◽

Learning Algorithms ◽

Nearest Neighbors ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Ensemble Model ◽

K Nearest Neighbors ◽

Crisis Care

Abstract Background Accurate prediction models for whether patients on the verge of a psychiatric criseis need hospitalization are lacking and machine learning methods may help improve the accuracy of psychiatric hospitalization prediction models. In this paper we evaluate the accuracy of ten machine learning algorithms, including the generalized linear model (GLM/logistic regression) to predict psychiatric hospitalization in the first 12 months after a psychiatric crisis care contact. We also evaluate an ensemble model to optimize the accuracy and we explore individual predictors of hospitalization. Methods Data from 2084 patients included in the longitudinal Amsterdam Study of Acute Psychiatry with at least one reported psychiatric crisis care contact were included. Target variable for the prediction models was whether the patient was hospitalized in the 12 months following inclusion. The predictive power of 39 variables related to patients’ socio-demographics, clinical characteristics and previous mental health care contacts was evaluated. The accuracy and area under the receiver operating characteristic curve (AUC) of the machine learning algorithms were compared and we also estimated the relative importance of each predictor variable. The best and least performing algorithms were compared with GLM/logistic regression using net reclassification improvement analysis and the five best performing algorithms were combined in an ensemble model using stacking. Results All models performed above chance level. We found Gradient Boosting to be the best performing algorithm (AUC = 0.774) and K-Nearest Neighbors to be the least performing (AUC = 0.702). The performance of GLM/logistic regression (AUC = 0.76) was slightly above average among the tested algorithms. In a Net Reclassification Improvement analysis Gradient Boosting outperformed GLM/logistic regression by 2.9% and K-Nearest Neighbors by 11.3%. GLM/logistic regression outperformed K-Nearest Neighbors by 8.7%. Nine of the top-10 most important predictor variables were related to previous mental health care use. Conclusions Gradient Boosting led to the highest predictive accuracy and AUC while GLM/logistic regression performed average among the tested algorithms. Although statistically significant, the magnitude of the differences between the machine learning algorithms was in most cases modest. The results show that a predictive accuracy similar to the best performing model can be achieved when combining multiple algorithms in an ensemble model.

Download Full-text

Predicting Hospitalization following Psychiatric Crisis Care using Machine Learning

10.21203/rs.2.12338/v1 ◽

2019 ◽

Author(s):

Matthijs Blankers ◽

Louk F. M. van der Post ◽

Jack J. M. Dekker

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Learning Algorithms ◽

Nearest Neighbors ◽

Machine Learning Algorithms ◽

Predictor Variables ◽

Gradient Boosting ◽

K Nearest Neighbors ◽

Psychiatric Crisis ◽

Crisis Care

Abstract Background: It is difficult to accurately predict whether a patient on the verge of a potential psychiatric crisis will need to be hospitalized. Machine learning may be helpful to improve the accuracy of psychiatric hospitalization prediction models. In this paper we evaluate and compare the accuracy of ten machine learning algorithms including the commonly used generalized linear model (GLM/logistic regression) to predict psychiatric hospitalization in the first 12 months after a psychiatric crisis care contact, and explore the most important predictor variables of hospitalization. Methods: Data from 2,084 patients with at least one reported psychiatric crisis care contact included in the longitudinal Amsterdam Study of Acute Psychiatry were used. The accuracy and area under the receiver operating characteristic curve (AUC) of the machine learning algorithms were compared. We also estimated the relative importance of each predictor variable. The best and least performing algorithms were compared with GLM/logistic regression using net reclassification improvement analysis. Target variable for the prediction models was whether or not the patient was hospitalized in the 12 months following inclusion in the study. The 39 predictor variables were related to patients’ socio-demographics, clinical characteristics and previous mental health care contacts. Results: We found Gradient Boosting to perform the best (AUC=0.774) and K-Nearest Neighbors performing the least (AUC=0.702). The performance of GLM/logistic regression (AUC=0.76) was above average among the tested algorithms. Gradient Boosting outperformed GLM/logistic regression and K-Nearest Neighbors, and GLM outperformed K-Nearest Neighbors in a Net Reclassification Improvement analysis, although the differences between Gradient Boosting and GLM/logistic regression were small. Nine of the top-10 most important predictor variables were related to previous mental health care use. Conclusions: Gradient Boosting led to the highest predictive accuracy and AUC while GLM/logistic regression performed average among the tested algorithms. Although statistically significant, the magnitude of the differences between the machine learning algorithms was modest. Future studies may consider to combine multiple algorithms in an ensemble model for optimal performance and to mitigate the risk of choosing suboptimal performing algorithms.

Download Full-text

An analytical survey on the role of machine learning algorithms in case of intrusion detection

ACCENTS Transactions on Information Security ◽

10.19101/tis.2020.517002 ◽

2020 ◽

Vol 5 (19) ◽

pp. 32-35

Author(s):

Anand Vijay ◽

Kailash Patidar ◽

Manoj Yadav ◽

Rishi Kushwah

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Intrusion Detection ◽

Intrusion Detection System ◽

Detection System ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Handling Mechanism ◽

The Impact

In this paper an analytical survey on the role of machine learning algorithms in case of intrusion detection has been presented and discussed. This paper shows the analytical aspects in the development of efficient intrusion detection system (IDS). The related study for the development of this system has been presented in terms of computational methods. The discussed methods are data mining, artificial intelligence and machine learning. It has been discussed along with the attack parameters and attack types. This paper also elaborates the impact of different attack and handling mechanism based on the previous papers.

Download Full-text

Teleconsultations between Patients and Healthcare Professionals in Primary Care in Catalonia: the Evaluation of Text Classification Algorithms Using Machine Learning

10.20944/preprints201912.0220.v1 ◽

2019 ◽

Author(s):

Francesc López Seguí ◽

Ricardo Ander Egg Aguilar ◽

Gabriel de Maeztu ◽

Anna García-Altés ◽

Francesc García Cuyàs ◽

...

Keyword(s):

Machine Learning ◽

Primary Care ◽

Text Classification ◽

Learning Strategy ◽

Care Service ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Face To Face ◽

Classification Tool ◽

The Impact

Background: the primary care service in Catalonia has operated an asynchronous teleconsulting service between GPs and patients since 2015 (eConsulta), which has generated some 500,000 messages. New developments in big data analysis tools, particularly those involving natural language, can be used to accurately and systematically evaluate the impact of the service. Objective: the study was intended to examine the predictive potential of eConsulta messages through different combinations of vector representation of text and machine learning algorithms and to evaluate their performance. Methodology: 20 machine learning algorithms (based on 5 types of algorithms and 4 text representation techniques)were trained using a sample of 3,559 messages (169,102 words) corresponding to 2,268 teleconsultations (1.57 messages per teleconsultation) in order to predict the three variables of interest (avoiding the need for a face-to-face visit, increased demand and type of use of the teleconsultation). The performance of the various combinations was measured in terms of precision, sensitivity, F-value and the ROC curve. Results: the best-trained algorithms are generally effective, proving themselves to be more robust when approximating the two binary variables "avoiding the need of a face-to-face visit" and "increased demand" (precision = 0.98 and 0.97, respectively) rather than the variable "type of query"(precision = 0.48). Conclusion: to the best of our knowledge, this study is the first to investigate a machine learning strategy for text classification using primary care teleconsultation datasets. The study illustrates the possible capacities of text analysis using artificial intelligence. The development of a robust text classification tool could be feasible by validating it with more data, making it potentially more useful for decision support for health professionals.

Download Full-text