Guidelines for the validation of machine learning predictions of species interactions

Mapping Intimacies ◽

10.32942/osf.io/aty7n ◽

2022 ◽

Author(s):

Timothée Poisot

Keyword(s):

Predictive Models ◽

Species Interactions ◽

Specific Problem ◽

Class Imbalance ◽

Training Dataset ◽

Training Set ◽

Interaction Prediction ◽

Mathematical Arguments ◽

Data Volume ◽

Binary Classifiers

1. The prediction of species interactions is gaining momentum as a way to circumvent limitations in data volume. Yet, ecological networks are challenging to predict because they are typically small and sparse. Dealing with extreme class imbalance is a challenge for most binary classifiers, and there are currently no guidelines as to how predictive models can be trained for this specific problem.2. Using simple mathematical arguments and numerical experiments in which a variety of classifiers (for supervised learning) are trained on simulated networks, we develop a series of guidelines related to the choice of measures to use for model selection, and the degree of unbiasing to apply to the training dataset.3. Neither classifier accuracy nor the ROC-AUC are informative measures for the performance of interaction prediction. PR-AUC is a fairer assessment of performance. In some cases, even standard measures can lead to selecting a more biased classifier because the effect of connectance is strong. The amount of correction to apply to the training dataset depends on network connectance, on the measure to be optimized, and only weakly on the classifier.4. These results reveal that training machines to predict networks is a challenging task, and that in virtually all cases, the composition of the training set needs to be experimented on before performing the actual training. We discuss these consequences in the context of the low volume of data.

Download Full-text

Evaluation of Power Insulator Detection Efficiency with the Use of Limited Training Dataset

Applied Sciences ◽

10.3390/app10062104 ◽

2020 ◽

Vol 10 (6) ◽

pp. 2104

Author(s):

Michał Tomaszewski ◽

Paweł Michalski ◽

Jakub Osuchowski

Keyword(s):

Neural Network ◽

Neural Networks ◽

Object Detection ◽

Convolutional Neural Network ◽

Deep Neural Networks ◽

Detection Efficiency ◽

Training Data ◽

Training Dataset ◽

Training Set ◽

Convolutional Network

This article presents an analysis of the effectiveness of object detection in digital images with the application of a limited quantity of input. The possibility of using a limited set of learning data was achieved by developing a detailed scenario of the task, which strictly defined the conditions of detector operation in the considered case of a convolutional neural network. The described solution utilizes known architectures of deep neural networks in the process of learning and object detection. The article presents comparisons of results from detecting the most popular deep neural networks while maintaining a limited training set composed of a specific number of selected images from diagnostic video. The analyzed input material was recorded during an inspection flight conducted along high-voltage lines. The object detector was built for a power insulator. The main contribution of the presented papier is the evidence that a limited training set (in our case, just 60 training frames) could be used for object detection, assuming an outdoor scenario with low variability of environmental conditions. The decision of which network will generate the best result for such a limited training set is not a trivial task. Conducted research suggests that the deep neural networks will achieve different levels of effectiveness depending on the amount of training data. The most beneficial results were obtained for two convolutional neural networks: the faster region-convolutional neural network (faster R-CNN) and the region-based fully convolutional network (R-FCN). Faster R-CNN reached the highest AP (average precision) at a level of 0.8 for 60 frames. The R-FCN model gained a worse AP result; however, it can be noted that the relationship between the number of input samples and the obtained results has a significantly lower influence than in the case of other CNN models, which, in the authors’ assessment, is a desired feature in the case of a limited training set.

Download Full-text

Development of Predictive Models for “Very Poor” Beach Water Quality Gradings Using Class-Imbalance Learning

Environmental Science & Technology ◽

10.1021/acs.est.1c03350 ◽

2021 ◽

Author(s):

Jiuhao Guo ◽

Joseph H. W. Lee

Keyword(s):

Water Quality ◽

Predictive Models ◽

Class Imbalance ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

Beach Water Quality ◽

Beach Water

Download Full-text

Algorithm for Automated Generation of a Training Sample for Solving the Problem of Determining Semantic Similarity between a Pair of Keywords using Machine Learning Methods

PROGRAMMNAYA INGENERIA ◽

10.17587/prin.12.283-294 ◽

2021 ◽

Vol 12 (6) ◽

pp. 283-294

Author(s):

K. V. Lunev ◽

Keyword(s):

Machine Learning ◽

Semantic Similarity ◽

Training Sample ◽

Subject Area ◽

Training Dataset ◽

Training Set ◽

Automated Generation ◽

Machine Learning Methods ◽

Subjective Value ◽

The Subject

Currently, machine learning is an effective approach to solving many problems of information-analytical systems. To use such approaches, a training set of examples is required. Collecting a training dataset is usually a time-consuming process. Its implementation requires the participation of several experts in the subject area for which the training set is collected. Moreover, for some tasks, including the task of determining the semantic similarity of keyword pairs, it is difficult even to correctly draw up instructions for experts to adequately evaluate the test examples. The reason for such difficulties is that semantic similarity is a subjective value and strongly depends on the scope, context, person, and task. The article presents the results of research on the search for models, algorithms and software tools for the automated formation of objects of the training sample in the problem of determining the semantic similarity of a pair of words. In addition, models built on an automated training sample allow us to solve not only the problem of determining semantic similarity, but also an arbitrary problem of classifying edges of a graph. The methods used in this paper are based on graph theory algorithms.

Download Full-text

Machine Learning Readmission Risk Modeling: A Pediatric Case Study

BioMed Research International ◽

10.1155/2019/8532892 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Patricio Wolff ◽

Manuel Graña ◽

Sebastián A. Ríos ◽

Maria Begoña Yarza

Keyword(s):

Machine Learning ◽

Multilayer Perceptron ◽

Naive Bayes ◽

Class Imbalance ◽

Predictive Performance ◽

Naïve Bayes ◽

Distribution Model ◽

Training Dataset ◽

Support Vector ◽

Pediatric Hospital

Background. Hospital readmission prediction in pediatric hospitals has received little attention. Studies have focused on the readmission frequency analysis stratified by disease and demographic/geographic characteristics but there are no predictive modeling approaches, which may be useful to identify preventable readmissions that constitute a major portion of the cost attributed to readmissions.Objective. To assess the all-cause readmission predictive performance achieved by machine learning techniques in the emergency department of a pediatric hospital in Santiago, Chile.Materials. An all-cause admissions dataset has been collected along six consecutive years in a pediatric hospital in Santiago, Chile. The variables collected are the same used for the determination of the child’s treatment administrative cost.Methods. Retrospective predictive analysis of 30-day readmission was formulated as a binary classification problem. We report classification results achieved with various model building approaches after data curation and preprocessing for correction of class imbalance. We compute repeated cross-validation (RCV) with decreasing number of folders to assess performance and sensitivity to effect of imbalance in the test set and training set size.Results. Increase in recall due to SMOTE class imbalance correction is large and statistically significant. The Naive Bayes (NB) approach achieves the best AUC (0.65); however the shallow multilayer perceptron has the best PPV and f-score (5.6 and 10.2, resp.). The NB and support vector machines (SVM) give comparable results if we consider AUC, PPV, and f-score ranking for all RCV experiments. High recall of deep multilayer perceptron is due to high false positive ratio. There is no detectable effect of the number of folds in the RCV on the predictive performance of the algorithms.Conclusions. We recommend the use of Naive Bayes (NB) with Gaussian distribution model as the most robust modeling approach for pediatric readmission prediction, achieving the best results across all training dataset sizes. The results show that the approach could be applied to detect preventable readmissions.

Download Full-text

Prototyping a Social Media Flooding Photo Screening System Based on Deep Learning

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9020104 ◽

2020 ◽

Vol 9 (2) ◽

pp. 104 ◽

Cited By ~ 7

Author(s):

Huan Ning ◽

Zhenlong Li ◽

Michael E. Hodgson ◽

Cuizhen (Susan) Wang

Keyword(s):

Social Media ◽

Visual Information ◽

Decision Makers ◽

Training Dataset ◽

Screening System ◽

Flood Events ◽

Training Set ◽

Photo Detection ◽

Damage Impact ◽

Total Accuracy

This article aims to implement a prototype screening system to identify flooding-related photos from social media. These photos, associated with their geographic locations, can provide free, timely, and reliable visual information about flood events to the decision-makers. This screening system, designed for application to social media images, includes several key modules: tweet/image downloading, flooding photo detection, and a WebGIS application for human verification. In this study, a training dataset of 4800 flooding photos was built based on an iterative method using a convolutional neural network (CNN) developed and trained to detect flooding photos. The system was designed in a way that the CNN can be re-trained by a larger training dataset when more analyst-verified flooding photos are being added to the training set in an iterative manner. The total accuracy of flooding photo detection was 93% in a balanced test set, and the precision ranges from 46–63% in the highly imbalanced real-time tweets. The system is plug-in enabled, permitting flexible changes to the classification module. Therefore, the system architecture and key components may be utilized in other types of disaster events, such as wildfires, earthquakes for the damage/impact assessment.

Download Full-text

Ortholog-based protein-protein interaction prediction and its application to inter-species interactions

BMC Bioinformatics ◽

10.1186/1471-2105-9-s12-s11 ◽

2008 ◽

Vol 9 (Suppl 12) ◽

pp. S11 ◽

Cited By ~ 54

Author(s):

Sheng-An Lee ◽

Cheng-hsiung Chan ◽

Chi-Hung Tsai ◽

Jin-Mei Lai ◽

Feng-Sheng Wang ◽

...

Keyword(s):

Protein Interaction ◽

Species Interactions ◽

Interaction Prediction ◽

Protein Protein Interaction ◽

Protein Interaction Prediction

Download Full-text

A Data-Analytics Tutorial: Building Predictive Models for Oil Production in an Unconventional Shale Reservoir

SPE Journal ◽

10.2118/189969-pa ◽

2018 ◽

Vol 23 (04) ◽

pp. 1075-1089 ◽

Cited By ~ 14

Author(s):

Jared Schuetter ◽

Srikanta Mishra ◽

Ming Zhong ◽

Randy LaFollette (ret.)

Keyword(s):

Predictive Models ◽

Decision Rules ◽

Regression Tree ◽

Production Performance ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Training Set ◽

Test Set ◽

Well Completion

Summary Considerable amounts of data are being generated during the development and operation of unconventional reservoirs. Statistical methods that can provide data-driven insights into production performance are gaining in popularity. Unfortunately, the application of advanced statistical algorithms remains somewhat of a mystery to petroleum engineers and geoscientists. The objective of this paper is to provide some clarity to this issue, focusing on how to build robust predictive models and how to develop decision rules that help identify factors separating good wells from poor performers. The data for this study come from wells completed in the Wolfcamp Shale Formation in the Permian Basin. Data categories used in the study included well location and assorted metrics capturing various aspects of well architecture, well completion, stimulation, and production. Predictive models for the production metric of interest are built using simple regression and other advanced methods such as random forests (RFs), support-vector regression (SVR), gradient-boosting machine (GBM), and multidimensional Kriging. The data-fitting process involves splitting the data into a training set and a test set, building a regression model on the training set and validating it with the test set. Repeated application of a “cross-validation” procedure yields valuable information regarding the robustness of each regression-modeling approach. Furthermore, decision rules that can identify extreme behavior in production wells (i.e., top x% of the wells vs. bottom x%, as ranked by the production metric) are generated using the classification and regression-tree algorithm. The resulting decision tree (DT) provides useful insights regarding what variables (or combinations of variables) can drive production performance into such extreme categories. The main contributions of this paper are to provide guidelines on how to build robust predictive models, and to demonstrate the utility of DTs for identifying factors responsible for good vs. poor wells.

Download Full-text

Human Activity Recognition Using Binary Sensors, BLE Beacons, an Intelligent Floor and Acceleration Data: A Machine Learning Approach

Proceedings ◽

10.3390/proceedings2191265 ◽

2018 ◽

Vol 2 (19) ◽

pp. 1265 ◽

Cited By ~ 1

Author(s):

Jesús D. Cerón ◽

Diego M. López ◽

Bjoern M. Eskofier

Keyword(s):

Machine Learning ◽

Activity Recognition ◽

Human Activity ◽

Class Imbalance ◽

Human Activity Recognition ◽

Training Dataset ◽

Desktop Application ◽

Acceleration Sensors ◽

Machine Learning Approach ◽

Mining Projects

Although there have been many studies aimed at the field of Human Activity Recognition, the relationship between what we do and where we do it has been little explored in this field. The objective of this paper is to propose an approach based on machine learning to address the challenge of the 1st UCAmI cup, which is the recognition of 24 activities of daily living using a dataset that allows to explore the aforementioned relationship, since it contains data collected from four data sources: binary sensors, an intelligent floor, proximity and acceleration sensors. The methodology for data mining projects CRISP-DM was followed in this work. To perform synchronization and classification tasks a java desktop application was developed. As a result, the accuracy achieved in the classification of the 24 activities using 10-fold-cross-validation on the training dataset was 92.1%, but an accuracy of 60.1% was obtained on the test dataset. The low accuracy of the classification might be caused by the class imbalance of the training dataset; therefore, more labeled data are necessary for training the algorithm. Although we could not obtain an optimal result, it is possible to iterate in the methodology to look for a way to improve the obtained results.

Download Full-text

The Use of Ensemble-Based Data Preprocessing Techniques for Software Defect Prediction

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194014400105 ◽

2014 ◽

Vol 24 (09) ◽

pp. 1229-1253 ◽

Cited By ~ 3

Author(s):

Kehan Gao ◽

Taghi M. Khoshgoftaar ◽

Amri Napolitano

Keyword(s):

Feature Selection ◽

Prediction Models ◽

Measurement Data ◽

Class Imbalance ◽

Data Preprocessing ◽

High Dimensionality ◽

Training Dataset ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Software defect prediction models that use software metrics such as code-level measurements and defect data to build classification models are useful tools for identifying potentially-problematic program modules. Effectiveness of detecting such modules is affected by the software measurements used, making data preprocessing an important step during software quality prediction. Generally, there are two problems affecting software measurement data: high dimensionality (where a training dataset has an extremely large number of independent attributes, or features) and class imbalance (where a training dataset has one class with relatively many more members than the other class). In this paper, we present a novel form of ensemble learning based on boosting that incorporates data sampling to alleviate class imbalance and feature (software metric) selection to address high dimensionality. As we adopt two different sampling methods (Random Undersampling (RUS) and Synthetic Minority Oversampling (SMOTE)) in the technique, we have two forms of our new ensemble-based approach: selectRUSBoost and selectSMOTEBoost. To evaluate the effectiveness of these new techniques, we apply them to two groups of datasets from two real-world software systems. In the experiments, four learners and nine feature selection techniques are employed to build our models. We also consider versions of the technique which do not incorporate feature selection, and compare all four techniques (the two different ensemble-based approaches which utilize feature selection and the two versions which use sampling only). The experimental results demonstrate that selectRUSBoost is generally more effective in improving defect prediction performance than selectSMOTEBoost, and that the techniques with feature selection do help for getting better prediction than the techniques without feature selection.

Download Full-text

Teaching basic numeracy, predictive models and socioeconomics to marine ecologists through Bayesian belief networks

F1000Research ◽

10.12688/f1000research.5981.1 ◽

2014 ◽

Vol 3 ◽

pp. 312 ◽

Cited By ~ 2

Author(s):

Richard Stafford ◽

Rachel Williams

Keyword(s):

Predictive Models ◽

Species Interactions ◽

Rocky Shore ◽

Marine Ecosystem ◽

Predictive Modelling ◽

Bayesian Belief Networks ◽

Fishing Effort ◽

Management Approach ◽

Afternoon Session ◽

Belief Network

Teaching numeric disciplines to higher education students in many life sciences disciplines is highly challenging. In this study, we test whether an approach linking field observations with predictive models can be useful in allowing students to understand basic numeracy and probability, as well as developing skills in modelling, understanding species interactions and even community/ecosystem-service interactions. We presented a field-based lecture in a morning session (on rocky shore ecology), followed by an afternoon session parameterising a belief network using a simple, user-friendly interface. The study was conducted with students during their second week of a foundation degree, hence having little prior knowledge of these systems or models. All students could create realistic predictive models of competition, predation and grazing, although most initially failed to account for trophic cascade effects in parameterising their models of the rocky shore they had previously seen. The belief network was then modified to account for a marine ecosystem management approach, where fishing effort and economic benefit of fishing were linked to population abundance of different species, and management goals were included. Students had little difficultly in applying conceptual links between species and ecosystem services in the same manner as between species. Students evaluated their understanding of a range of variables from rocky shore knowledge to marine management as increasing over the session, but the role of the predictive modelling task was indicated as a major source of learning, even for topics we thought may be better learned in the field. The study adds evidence to the theories that students benefit from exposure to numeric topics, even very early in their degree programmes, but students grasp concepts better when applied to real world situations which they have experience of, or perceive as important.

Download Full-text