Predicting Spatial Crime Occurrences through an Efficient Ensemble-Learning Model

While the use of crime data has been widely advocated in the literature, its availability is often limited to large urban cities and isolated databases that tend not to allow for spatial comparisons. This paper presents an efficient machine learning framework capable of predicting spatial crime occurrences, without using past crime as a predictor, and at a relatively high resolution: the U.S. Census Block Group level. The proposed framework is based on an in-depth multidisciplinary literature review allowing the selection of 188 best-fit crime predictors from socio-economic, demographic, spatial, and environmental data. Such data are published periodically for the entire United States. The selection of the appropriate predictive model was made through a comparative study of different machine learning families of algorithms, including generalized linear models, deep learning, and ensemble learning. The gradient boosting model was found to yield the most accurate predictions for violent crimes, property crimes, motor vehicle thefts, vandalism, and the total count of crimes. Extensive experiments on real-world datasets of crimes reported in 11 U.S. cities demonstrated that the proposed framework achieves an accuracy of 73% and 77% when predicting property crimes and violent crimes, respectively.

Download Full-text

Utilizing Data-Driven Models to Predict Brittleness in Tuscaloosa Marine Shale: A Machine Learning Approach

10.2118/208628-stu ◽

2021 ◽

Author(s):

Jamal Ahmadov

Keyword(s):

Machine Learning ◽

Random Forest ◽

Brittleness Index ◽

Estimation Methods ◽

Gradient Boosting ◽

Average Error ◽

Support Vector ◽

Marine Shale ◽

Effective Manner ◽

Selection Of

Abstract The Tuscaloosa Marine Shale (TMS) formation is a clay- and liquid-rich emerging shale play across central Louisiana and southwest Mississippi with recoverable resources of 1.5 billion barrels of oil and 4.6 trillion cubic feet of gas. The formation poses numerous challenges due to its high average clay content (50 wt%) and rapidly changing mineralogy, making the selection of fracturing candidates a difficult task. While brittleness plays an important role in screening potential intervals for hydraulic fracturing, typical brittleness estimation methods require the use of geomechanical and mineralogical properties from costly laboratory tests. Machine Learning (ML) can be employed to generate synthetic brittleness logs and therefore, may serve as an inexpensive and fast alternative to the current techniques. In this paper, we propose the use of machine learning to predict the brittleness index of Tuscaloosa Marine Shale from conventional well logs. We trained ML models on a dataset containing conventional and brittleness index logs from 8 wells. The latter were estimated either from geomechanical logs or log-derived mineralogy. Moreover, to ensure mechanical data reliability, dynamic-to-static conversion ratios were applied to Young's modulus and Poisson's ratio. The predictor features included neutron porosity, density and compressional slowness logs to account for the petrophysical and mineralogical character of TMS. The brittleness index was predicted using algorithms such as Linear, Ridge and Lasso Regression, K-Nearest Neighbors, Support Vector Machine (SVM), Decision Tree, Random Forest, AdaBoost and Gradient Boosting. Models were shortlisted based on the Root Mean Square Error (RMSE) value and fine-tuned using the Grid Search method with a specific set of hyperparameters for each model. Overall, Gradient Boosting and Random Forest outperformed other algorithms and showed an average error reduction of 5 %, a normalized RMSE of 0.06 and a R-squared value of 0.89. The Gradient Boosting was chosen to evaluate the test set and successfully predicted the brittleness index with a normalized RMSE of 0.07 and R-squared value of 0.83. This paper presents the practical use of machine learning to evaluate brittleness in a cost and time effective manner and can further provide valuable insights into the optimization of completion in TMS. The proposed ML model can be used as a tool for initial screening of fracturing candidates and selection of fracturing intervals in other clay-rich and heterogeneous shale formations.

Download Full-text

Machine learning algorithms for predicting undernutrition among under-five children in Ethiopia

Public Health Nutrition ◽

10.1017/s1368980021004262 ◽

2021 ◽

pp. 1-29

Author(s):

Fikrewold H. Bitew ◽

Corey S. Sparks ◽

Samuel H. Nyarko

Keyword(s):

Machine Learning ◽

Linear Models ◽

Learning Algorithms ◽

Public Health Problem ◽

Water Source ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Global Public Health ◽

Prediction Ability ◽

Extreme Gradient Boosting

Abstract Objective: Child undernutrition is a global public health problem with serious implications. In this study, estimate predictive algorithms for the determinants of childhood stunting by using various machine learning (ML) algorithms. Design: This study draws on data from the Ethiopian Demographic and Health Survey of 2016. Five machine learning algorithms including eXtreme gradient boosting (xgbTree), k-nearest neighbors (K-NN), random forest (RF), neural network (NNet), and the generalized linear models (GLM) were considered to predict the socio-demographic risk factors for undernutrition in Ethiopia. Setting: Households in Ethiopia. Participants: A total of 9,471 children below five years of age. Results: The descriptive results show substantial regional variations in child stunting, wasting, and underweight in Ethiopia. Also, among the five ML algorithms, xgbTree algorithm shows a better prediction ability than the generalized linear mixed algorithm. The best predicting algorithm (xgbTree) shows diverse important predictors of undernutrition across the three outcomes which include time to water source, anemia history, child age greater than 30 months, small birth size, and maternal underweight, among others. Conclusions: The xgbTree algorithm was a reasonably superior ML algorithm for predicting childhood undernutrition in Ethiopia compared to other ML algorithms considered in this study. The findings support improvement in access to water supply, food security, and fertility regulation among others in the quest to considerably improve childhood nutrition in Ethiopia.

Download Full-text

A Comparative Analysis of Novel Deep Learning and Ensemble Learning Models to Predict the Allergenicity of Food Proteins

Foods ◽

10.3390/foods10040809 ◽

2021 ◽

Vol 10 (4) ◽

pp. 809

Author(s):

Liyang Wang ◽

Dantong Niu ◽

Xinjie Zhao ◽

Xiaoya Wang ◽

Mengzhen Hao ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Ensemble Learning ◽

Food Allergen ◽

Gradient Boosting ◽

Learning Models ◽

Food Proteins ◽

Deep Model ◽

Allergen Identification ◽

Machine Learning Models

Traditional food allergen identification mainly relies on in vivo and in vitro experiments, which often needs a long period and high cost. The artificial intelligence (AI)-driven rapid food allergen identification method has solved the above mentioned some drawbacks and is becoming an efficient auxiliary tool. Aiming to overcome the limitations of lower accuracy of traditional machine learning models in predicting the allergenicity of food proteins, this work proposed to introduce deep learning model—transformer with self-attention mechanism, ensemble learning models (representative as Light Gradient Boosting Machine (LightGBM) eXtreme Gradient Boosting (XGBoost)) to solve the problem. In order to highlight the superiority of the proposed novel method, the study also selected various commonly used machine learning models as the baseline classifiers. The results of 5-fold cross-validation showed that the area under the receiver operating characteristic curve (AUC) of the deep model was the highest (0.9578), which was better than the ensemble learning and baseline algorithms. But the deep model need to be pre-trained, and the training time is the longest. By comparing the characteristics of the transformer model and boosting models, it can be analyzed that, each model has its own advantage, which provides novel clues and inspiration for the rapid prediction of food allergens in the future.

Download Full-text

Machine Learning and Financial Literacy: An Exploration of Factors Influencing Financial Knowledge in Italy

Journal of Risk and Financial Management ◽

10.3390/jrfm14030120 ◽

2021 ◽

Vol 14 (3) ◽

pp. 120

Author(s):

Susanna Levantesi ◽

Giulia Zacchia

Keyword(s):

Machine Learning ◽

Financial Literacy ◽

Linear Models ◽

Financial Knowledge ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Learning Techniques ◽

Valuable Complement ◽

Standard Models ◽

Main Determinants

In recent years, machine learning techniques have assumed an increasingly central role in many areas of research, from computer science to medicine, including finance. In the current study, we applied it to financial literacy to test its accuracy, compared to a standard parametric model, in the estimation of the main determinants of financial knowledge. Using recent data on financial literacy and inclusion among Italian adults, we empirically tested how tree-based machine learning methods, such as decision trees, random, forest and gradient boosting techniques, can be a valuable complement to standard models (generalized linear models) for the identification of the groups in the population in most need of improving their financial knowledge.

Download Full-text

Prediction of Maize Phenotypic Traits With Genomic and Environmental Predictors Using Gradient Boosting Frameworks

Frontiers in Plant Science ◽

10.3389/fpls.2021.699589 ◽

2021 ◽

Vol 12 ◽

Author(s):

Cathy C. Westhues ◽

Gregory S. Mahone ◽

Sofia da Silva ◽

Patrick Thorwarth ◽

Malthe Schmidt ◽

...

Keyword(s):

Machine Learning ◽

Grain Yield ◽

Prediction Models ◽

Predictive Ability ◽

The United States ◽

Environmental Data ◽

Gradient Boosting ◽

Phenotypic Traits ◽

Environmental Predictors ◽

Prediction Problems

The development of crop varieties with stable performance in future environmental conditions represents a critical challenge in the context of climate change. Environmental data collected at the field level, such as soil and climatic information, can be relevant to improve predictive ability in genomic prediction models by describing more precisely genotype-by-environment interactions, which represent a key component of the phenotypic response for complex crop agronomic traits. Modern predictive modeling approaches can efficiently handle various data types and are able to capture complex nonlinear relationships in large datasets. In particular, machine learning techniques have gained substantial interest in recent years. Here we examined the predictive ability of machine learning-based models for two phenotypic traits in maize using data collected by the Maize Genomes to Fields (G2F) Initiative. The data we analyzed consisted of multi-environment trials (METs) dispersed across the United States and Canada from 2014 to 2017. An assortment of soil- and weather-related variables was derived and used in prediction models alongside genotypic data. Linear random effects models were compared to a linear regularized regression method (elastic net) and to two nonlinear gradient boosting methods based on decision tree algorithms (XGBoost, LightGBM). These models were evaluated under four prediction problems: (1) tested and new genotypes in a new year; (2) only unobserved genotypes in a new year; (3) tested and new genotypes in a new site; (4) only unobserved genotypes in a new site. Accuracy in forecasting grain yield performance of new genotypes in a new year was improved by up to 20% over the baseline model by including environmental predictors with gradient boosting methods. For plant height, an enhancement of predictive ability could neither be observed by using machine learning-based methods nor by using detailed environmental information. An investigation of key environmental factors using gradient boosting frameworks also revealed that temperature at flowering stage, frequency and amount of water received during the vegetative and grain filling stage, and soil organic matter content appeared as important predictors for grain yield in our panel of environments.

Download Full-text

Performance and clinical utility of supervised machine-learning approaches in detecting familial hypercholesterolaemia in primary care

npj Digital Medicine ◽

10.1038/s41746-020-00349-5 ◽

2020 ◽

Vol 3 (1) ◽

Author(s):

Ralph K. Akyea ◽

Nadeem Qureshi ◽

Joe Kai ◽

Stephen F. Weng

Keyword(s):

Machine Learning ◽

Primary Care ◽

Logistic Regression ◽

Heart Disease ◽

Ensemble Learning ◽

Clinical Utility ◽

Familial Hypercholesterolaemia ◽

Predictive Accuracy ◽

Gradient Boosting ◽

Learning Approaches

Abstract Familial hypercholesterolaemia (FH) is a common inherited disorder, causing lifelong elevated low-density lipoprotein cholesterol (LDL-C). Most individuals with FH remain undiagnosed, precluding opportunities to prevent premature heart disease and death. Some machine-learning approaches improve detection of FH in electronic health records, though clinical impact is under-explored. We assessed performance of an array of machine-learning approaches for enhancing detection of FH, and their clinical utility, within a large primary care population. A retrospective cohort study was done using routine primary care clinical records of 4,027,775 individuals from the United Kingdom with total cholesterol measured from 1 January 1999 to 25 June 2019. Predictive accuracy of five common machine-learning algorithms (logistic regression, random forest, gradient boosting machines, neural networks and ensemble learning) were assessed for detecting FH. Predictive accuracy was assessed by area under the receiver operating curves (AUC) and expected vs observed calibration slope; with clinical utility assessed by expected case-review workload and likelihood ratios. There were 7928 incident diagnoses of FH. In addition to known clinical features of FH (raised total cholesterol or LDL-C and family history of premature coronary heart disease), machine-learning (ML) algorithms identified features such as raised triglycerides which reduced the likelihood of FH. Apart from logistic regression (AUC, 0.81), all four other ML approaches had similarly high predictive accuracy (AUC > 0.89). Calibration slope ranged from 0.997 for gradient boosting machines to 1.857 for logistic regression. Among those screened, high probability cases requiring clinical review varied from 0.73% using ensemble learning to 10.16% using deep learning, but with positive predictive values of 15.5% and 2.8% respectively. Ensemble learning exhibited a dominant positive likelihood ratio (45.5) compared to all other ML models (7.0–14.4). Machine-learning models show similar high accuracy in detecting FH, offering opportunities to increase diagnosis. However, the clinical case-finding workload required for yield of cases will differ substantially between models.

Download Full-text

Classification of hazelnut cultivars: comparison of DL4J and ensemble learning algorithms

Notulae Botanicae Horti Agrobotanici Cluj-Napoca ◽

10.15835/nbha48412041 ◽

2020 ◽

Vol 48 (4) ◽

pp. 2316-2327

Author(s):

Caner KOC ◽

Dilara GERDAN ◽

Maksut B. EMİNOĞLU ◽

Uğur YEGÜL ◽

Bulent KOC ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Random Forest ◽

Ensemble Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Performance Criteria ◽

Gradient Boosting ◽

Data Set

Classification of hazelnuts is one of the values adding processes that increase the marketability and profitability of its production. While traditional classification methods are used commonly, machine learning and deep learning can be implemented to enhance the hazelnut classification processes. This paper presents the results of a comparative study of machine learning frameworks to classify hazelnut (Corylus avellana L.) cultivars (‘Sivri’, ‘Kara’, ‘Tombul’) using DL4J and ensemble learning algorithms. For each cultivar, 50 samples were used for evaluations. Maximum length, width, compression strength, and weight of hazelnuts were measured using a caliper and a force transducer. Gradient boosting machine (Boosting), random forest (Bagging), and DL4J feedforward (Deep Learning) algorithms were applied in traditional machine learning algorithms. The data set was partitioned into a 10-fold-cross validation method. The classifier performance criteria of accuracy (%), error percentage (%), F-Measure, Cohen’s Kappa, recall, precision, true positive (TP), false positive (FP), true negative (TN), false negative (FN) values are provided in the results section. The results showed classification accuracies of 94% for Gradient Boosting, 100% for Random Forest, and 94% for DL4J Feedforward algorithms.

Download Full-text

Application of Ensemble Learning Using Weight Voting Protocol in the Prediction of Pile Bearing Capacity

Mathematical Problems in Engineering ◽

10.1155/2021/5558449 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Tuan Anh Pham ◽

Huong-Lan Thi Vu

Keyword(s):

Machine Learning ◽

Bearing Capacity ◽

Ensemble Learning ◽

Learning Algorithm ◽

Load Capacity ◽

Machine Learning Algorithms ◽

Load Test ◽

Gradient Boosting ◽

Foundation Engineering ◽

Expert Performance

Accurate prediction of pile bearing capacity is an important part of foundation engineering. Notably, the determination of pile bearing capacity through an in situ load test is costly and time-consuming. Therefore, this study focused on developing a machine learning algorithm, namely, Ensemble Learning (EL), using weight voting protocol of three base machine learning algorithms, gradient boosting (GB), random forest (RF), and classic linear regression (LR), to predict the bearing capacity of the pile. Data includes 108 pile load tests under different conditions used for model training and testing. Performance evaluation indicators such as R-square (R2), root mean square error (RMSE), and MAE (mean absolute error) were used to evaluate the performance of models showing the efficiency of predicting pile bearing capacity with outstanding performance compared to other models. The results also showed that the EL model with a weight combination of w 1 = 0.482, w 2 = 0.338, and w 3 = 0.18 corresponding to the models GB, RF, and LR gave the best performance and achieved the best balance on all data sets. In addition, the global sensitivity analysis technique was used to detect the most important input features in determining the bearing capacity of the pile. This study provides an effective tool to predict pile load capacity with expert performance.

Download Full-text

Reconstructing the biogeography of a hunter-gatherer planet using machine-learning

10.1101/2021.08.21.457222 ◽

2021 ◽

Author(s):

Marcus J. Hamilton ◽

Robert S. Walker ◽

Briggs Buchanan ◽

Damian E. Blasi ◽

Claire L. Bowern

Keyword(s):

Machine Learning ◽

Human Population ◽

Well Being ◽

Environmental Data ◽

Gradient Boosting ◽

Human Populations ◽

The Past ◽

Large Matrix ◽

Extreme Gradient Boosting ◽

The Tropics

Estimating the total human population size (i.e., abundance) of the preagricultural planet is important for setting the baseline expectations for human-environment interactions if all energy and material requirements to support growth, maintenance, and well-being were foraged from local environments. However, demographic parameters and biogeographic distributions do not preserve directly in the archaeological record. Rather than attempting to estimate human abundance at some specific time in the past, a principled approach to making inferences at this scale is to ask what the human demography and biogeography of a hypothetical planet Earth would look like if populated by ethnographic hunter-gatherer societies. Given ethnographic hunter-gatherer societies likely include the largest, densest, and most complex foraging societies to have existed, we suggest population inferences drawn from this sample provide an upper bound to demographic estimates in prehistory. Our goal in this paper is to produce principled estimates of hunter-gatherer abundance, diversity, and biogeography. To do this we trained an extreme gradient boosting algorithm (XGBoost) to learn ethnographic hunter-gatherer population densities from a large matrix of climatic, environmental, and geographic data. We used the predictions generated by this model to reconstruct the hunter-gatherer biogeography of the rest of the planet. We find the human abundance of this world to be 6.1±2 million with an ethnolinguistic diversity of 8,330±2,770 populations, most of whom would have lived near coasts and in the tropics.Significance StatementUnderstanding the abundance of humans on planet Earth prior to the development of agriculture and the industrialized world is essential to understanding human population growth. However, the problem is that these features of human populations in the past are unknown and so must be estimated from data. We developed a machine learning approach that uses ethnographic and environmental data to reconstruct the demography and biogeography of planet Earth if populated by hunter-gatherers. Such a world would house about 6 million people divided into about 8,330 populations with a particular concentration in the tropics and along coasts.

Download Full-text

Can machine learning bring cardiovascular risk assessment to the next level?

European Heart Journal - Digital Health ◽

10.1093/ehjdh/ztab093 ◽

2021 ◽

Author(s):

Adrien Rousset ◽

David Dellamonica ◽

Romuald Menuet ◽

Armando Lira Pineda ◽

Lea Ricci ◽

...

Keyword(s):

Machine Learning ◽

Risk Assessment ◽

Cardiovascular Risk ◽

Cardiovascular Events ◽

Linear Models ◽

Risk Scores ◽

Gradient Boosting ◽

Major Adverse Cardiovascular Events ◽

Learning Methods ◽

Machine Learning Methods

Abstract Objective Through this proof of concept, we studied the potential added value of machine learning methods in building cardiovascular risk scores from structured data and the conditions under which they outperform linear statistical models. Methods Relying on extensive cardiovascular clinical data from FOURIER, a randomized clinical trial to test for evolocumab efficacy, we compared linear models, neural networks, random forest, and gradient boosting machines for predicting the risk of major adverse cardiovascular events. To study the relative strengths of each method, we extended the comparison to restricted subsets of the full FOURIER dataset, limiting either the number of available patients or the number of their characteristics. Results When using all the 428 covariates available in the dataset, machine learning methods significantly (c-index 0.67, p-value 2e-5) outperformed linear models built from the same variables (c-index 0.62), as well as a reference cardiovascular risk score based on only 10 variables (c-index 0.60). We showed that gradient boosting—the best performing model in our setting—requires fewer patients and significantly outperforms linear models when using large numbers of variables. On the other hand, we illustrate how linear models suffer from being trained on too many variables, thus requiring a more careful prior selection. These machine learning methods proved to consistently improve risk assessment, to be interpretable despite their complexity and to help identify the minimal set of covariates necessary to achieve top performance. Conclusion In the field of secondary cardiovascular events prevention, given the increased availability of extensive electronic health records, machine learning methods could open the door to more powerful tools for patient risk stratification and treatment allocation strategies.

Download Full-text