Interval Coded Scoring: a toolbox for interpretable scoring systems

PeerJ Computer Science ◽

10.7717/peerj-cs.150 ◽

2018 ◽

Vol 4 ◽

pp. e150 ◽

Cited By ~ 3

Author(s):

Lieven Billiet ◽

Sabine Van Huffel ◽

Vanya Van Belle

Keyword(s):

Machine Learning ◽

Decision Support ◽

Expert Knowledge ◽

Real Life ◽

Scoring Systems ◽

Training Data ◽

Legal Responsibility ◽

Support Vector ◽

Medical Setting ◽

Learning Approaches

Over the last decades, clinical decision support systems have been gaining importance. They help clinicians to make effective use of the overload of available information to obtain correct diagnoses and appropriate treatments. However, their power often comes at the cost of a black box model which cannot be interpreted easily. This interpretability is of paramount importance in a medical setting with regard to trust and (legal) responsibility. In contrast, existing medical scoring systems are easy to understand and use, but they are often a simplified rule-of-thumb summary of previous medical experience rather than a well-founded system based on available data. Interval Coded Scoring (ICS) connects these two approaches, exploiting the power of sparse optimization to derive scoring systems from training data. The presented toolbox interface makes this theory easily applicable to both small and large datasets. It contains two possible problem formulations based on linear programming or elastic net. Both allow to construct a model for a binary classification problem and establish risk profiles that can be used for future diagnosis. All of this requires only a few lines of code. ICS differs from standard machine learning through its model consisting of interpretable main effects and interactions. Furthermore, insertion of expert knowledge is possible because the training can be semi-automatic. This allows end users to make a trade-off between complexity and performance based on cross-validation results and expert knowledge. Additionally, the toolbox offers an accessible way to assess classification performance via accuracy and the ROC curve, whereas the calibration of the risk profile can be evaluated via a calibration curve. Finally, the colour-coded model visualization has particular appeal if one wants to apply ICS manually on new observations, as well as for validation by experts in the specific application domains. The validity and applicability of the toolbox is demonstrated by comparing it to standard Machine Learning approaches such as Naive Bayes and Support Vector Machines for several real-life datasets. These case studies on medical problems show its applicability as a decision support system. ICS performs similarly in terms of classification and calibration. Its slightly lower performance is countered by its model simplicity which makes it the method of choice if interpretability is a key issue.

Download Full-text

Real-Time Human Detection for Aerial Captured Video Sequences via Deep Models

Computational Intelligence and Neuroscience ◽

10.1155/2018/1639561 ◽

2018 ◽

Vol 2018 ◽

pp. 1-14 ◽

Cited By ~ 10

Author(s):

Nouar AlDahoul ◽

Aznul Qalid Md Sabri ◽

Ali Mohammed Mansoor

Keyword(s):

Expert Knowledge ◽

Real Life ◽

Feature Learning ◽

Human Detection ◽

Support Vector ◽

Processing Unit ◽

Learning Approaches ◽

Training Time ◽

Central Processing ◽

Average Accuracy

Human detection in videos plays an important role in various real life applications. Most of traditional approaches depend on utilizing handcrafted features which are problem-dependent and optimal for specific tasks. Moreover, they are highly susceptible to dynamical events such as illumination changes, camera jitter, and variations in object sizes. On the other hand, the proposed feature learning approaches are cheaper and easier because highly abstract and discriminative features can be produced automatically without the need of expert knowledge. In this paper, we utilize automatic feature learning methods which combine optical flow and three different deep models (i.e., supervised convolutional neural network (S-CNN), pretrained CNN feature extractor, and hierarchical extreme learning machine) for human detection in videos captured using a nonstatic camera on an aerial platform with varying altitudes. The models are trained and tested on the publicly available and highly challenging UCF-ARG aerial dataset. The comparison between these models in terms of training, testing accuracy, and learning speed is analyzed. The performance evaluation considers five human actions (digging, waving, throwing, walking, and running). Experimental results demonstrated that the proposed methods are successful for human detection task. Pretrained CNN produces an average accuracy of 98.09%. S-CNN produces an average accuracy of 95.6% with soft-max and 91.7% with Support Vector Machines (SVM). H-ELM has an average accuracy of 95.9%. Using a normal Central Processing Unit (CPU), H-ELM’s training time takes 445 seconds. Learning in S-CNN takes 770 seconds with a high performance Graphical Processing Unit (GPU).

Download Full-text

Predicting Physician Consultations for Low Back Pain Using Claims Data and Population-Based Cohort Data—An Interpretable Machine Learning Approach

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph182212013 ◽

2021 ◽

Vol 18 (22) ◽

pp. 12013

Author(s):

Adrian Richter ◽

Julia Truthmann ◽

Jean-François Chenot ◽

Carsten Oliver Schmidt

Keyword(s):

Machine Learning ◽

Low Back Pain ◽

Back Pain ◽

Claims Data ◽

Population Based ◽

Training Data ◽

Support Vector ◽

Low Back ◽

Learning Approaches ◽

Validation Data

(1) Background: Predicting chronic low back pain (LBP) is of clinical and economic interest as LBP leads to disabilities and health service utilization. This study aims to build a competitive and interpretable prediction model; (2) Methods: We used clinical and claims data of 3837 participants of a population-based cohort study to predict future LBP consultations (ICD-10: M40.XX-M54.XX). Best subset selection (BSS) was applied in repeated random samples of training data (75% of data); scoring rules were used to identify the best subset of predictors. The rediction accuracy of BSS was compared to randomforest and support vector machines (SVM) in the validation data (25% of data); (3) Results: The best subset comprised 16 out of 32 predictors. Previous occurrence of LBP increased the odds for future LBP consultations (odds ratio (OR) 6.91 [5.05; 9.45]), while concomitant diseases reduced the odds (1 vs. 0, OR: 0.74 [0.57; 0.98], >1 vs. 0: 0.37 [0.21; 0.67]). The area-under-curve (AUC) of BSS was acceptable (0.78 [0.74; 0.82]) and comparable with SVM (0.78 [0.74; 0.82]) and randomforest (0.79 [0.75; 0.83]); (4) Conclusions: Regarding prediction accuracy, BSS has been considered competitive with established machine-learning approaches. Nonetheless, considerable misclassification is inherent and further refinements are required to improve predictions.

Download Full-text

Development of a machine-learning-based decision support mechanism for predicting chemical tanker cleaning activity

Journal of Modelling in Management ◽

10.1108/jm2-12-2019-0284 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Burak Cankaya ◽

Berna Eren Tokgoz ◽

Ali Dag ◽

K.C. Santosh

Keyword(s):

Machine Learning ◽

Decision Support ◽

Test Data ◽

Machine Learning Algorithms ◽

Training Data ◽

Comparative Approach ◽

Support Vector ◽

Data Set ◽

Content Type ◽

Vehicle Activity

Purpose This paper aims to propose a machine learning-based automatic labeling methodology for chemical tanker activities that can be applied to any port with any number of active tankers and the identification of important predictors. The methodology can be applied to any type of activity tracking that is based on automatically generated geospatial data. Design/methodology/approach The proposed methodology uses three machine learning algorithms (artificial neural networks, support vector machines (SVMs) and random forest) along with information fusion (IF)-based sensitivity analysis to classify chemical tanker activities. The data set is split into training and test data based on vessels, with two vessels in the training data and one in the test data set. Important predictors were identified using a receiver operating characteristic comparative approach, and overall variable importance was calculated using IF from the top models. Findings Results show that an SVM model has the best balance between sensitivity and specificity, at 93.5% and 91.4%, respectively. Speed, acceleration and change in the course on the ground for the vessels are identified as the most important predictors for classifying vessel activity. Research limitations/implications The study evaluates the vessel movements waiting between different terminals in the same port, but not their movements between different ports for their tank-cleaning activities. Practical implications The findings in this study can be used by port authorities, shipping companies, vessel operators and other stakeholders for decision support, performance tracking, as well as for automated alerts. Originality/value This analysis makes original contributions to the existing literature by defining and demonstrating a methodology that can automatically label vehicle activity based on location data and identify certain characteristics of the activity by finding important location-based predictors that effectively classify the activity status.

Download Full-text

Application of Machine Learning Approaches for the Design and Study of Anticancer Drugs

Current Drug Targets ◽

10.2174/1389450119666180809122244 ◽

2019 ◽

Vol 20 (5) ◽

pp. 488-500 ◽

Cited By ~ 6

Author(s):

Yan Hu ◽

Yi Lu ◽

Shuo Wang ◽

Mengying Zhang ◽

Xiaosheng Qu ◽

...

Keyword(s):

Machine Learning ◽

Drug Design ◽

Anticancer Drugs ◽

Nearest Neighbor ◽

Cost Effective ◽

Support Vector ◽

Learning Approaches ◽

K Nearest Neighbor ◽

Activity Prediction ◽

Linear Discriminant

Background: Globally the number of cancer patients and deaths are continuing to increase yearly, and cancer has, therefore, become one of the world's highest causes of morbidity and mortality. In recent years, the study of anticancer drugs has become one of the most popular medical topics. Objective: In this review, in order to study the application of machine learning in predicting anticancer drugs activity, some machine learning approaches such as Linear Discriminant Analysis (LDA), Principal components analysis (PCA), Support Vector Machine (SVM), Random forest (RF), k-Nearest Neighbor (kNN), and Naïve Bayes (NB) were selected, and the examples of their applications in anticancer drugs design are listed. Results: Machine learning contributes a lot to anticancer drugs design and helps researchers by saving time and is cost effective. However, it can only be an assisting tool for drug design. Conclusion: This paper introduces the application of machine learning approaches in anticancer drug design. Many examples of success in identification and prediction in the area of anticancer drugs activity prediction are discussed, and the anticancer drugs research is still in active progress. Moreover, the merits of some web servers related to anticancer drugs are mentioned.

Download Full-text

Multilayer Soil Moisture Mapping at a Regional Scale from Multisource Data via a Machine Learning Method

Remote Sensing ◽

10.3390/rs11030284 ◽

2019 ◽

Vol 11 (3) ◽

pp. 284 ◽

Cited By ~ 1

Author(s):

Linglin Zeng ◽

Shun Hu ◽

Daxiang Xiang ◽

Xiang Zhang ◽

Deren Li ◽

...

Keyword(s):

Machine Learning ◽

Soil Moisture ◽

Regional Scale ◽

Remotely Sensed ◽

Temporal Variations ◽

Training Data ◽

Estimation Accuracy ◽

Learning Approaches ◽

Remotely Sensed Data ◽

Deep Soil

Soil moisture mapping at a regional scale is commonplace since these data are required in many applications, such as hydrological and agricultural analyses. The use of remotely sensed data for the estimation of deep soil moisture at a regional scale has received far less emphasis. The objective of this study was to map the 500-m, 8-day average and daily soil moisture at different soil depths in Oklahoma from remotely sensed and ground-measured data using the random forest (RF) method, which is one of the machine-learning approaches. In order to investigate the estimation accuracy of the RF method at both a spatial and a temporal scale, two independent soil moisture estimation experiments were conducted using data from 2010 to 2014: a year-to-year experiment (with a root mean square error (RMSE) ranging from 0.038 to 0.050 m3/m3) and a station-to-station experiment (with an RMSE ranging from 0.044 to 0.057 m3/m3). Then, the data requirements, importance factors, and spatial and temporal variations in estimation accuracy were discussed based on the results using the training data selected by iterated random sampling. The highly accurate estimations of both the surface and the deep soil moisture for the study area reveal the potential of RF methods when mapping soil moisture at a regional scale, especially when considering the high heterogeneity of land-cover types and topography in the study area.

Download Full-text

Machine Learning Methods Applied to the Prediction of Pseudo-nitzschia spp. Blooms in the Galician Rias Baixas (NW Spain)

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10040199 ◽

2021 ◽

Vol 10 (4) ◽

pp. 199

Author(s):

Francisco M. Bellas Aláez ◽

Jesus M. Torres Palenzuela ◽

Evangelos Spyrakos ◽

Luis González Vilas

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Prediction Models ◽

Support Vector ◽

False Alarms ◽

Learning Approaches ◽

Learning Methods ◽

Machine Learning Methods ◽

Rías Baixas ◽

New Algorithms

This work presents new prediction models based on recent developments in machine learning methods, such as Random Forest (RF) and AdaBoost, and compares them with more classical approaches, i.e., support vector machines (SVMs) and neural networks (NNs). The models predict Pseudo-nitzschia spp. blooms in the Galician Rias Baixas. This work builds on a previous study by the authors (doi.org/10.1016/j.pocean.2014.03.003) but uses an extended database (from 2002 to 2012) and new algorithms. Our results show that RF and AdaBoost provide better prediction results compared to SVMs and NNs, as they show improved performance metrics and a better balance between sensitivity and specificity. Classical machine learning approaches show higher sensitivities, but at a cost of lower specificity and higher percentages of false alarms (lower precision). These results seem to indicate a greater adaptation of new algorithms (RF and AdaBoost) to unbalanced datasets. Our models could be operationally implemented to establish a short-term prediction system.

Download Full-text

NLOS Multipath Classification of GNSS Signal Correlation Output Using Machine Learning

Sensors ◽

10.3390/s21072503 ◽

2021 ◽

Vol 21 (7) ◽

pp. 2503

Author(s):

Taro Suzuki ◽

Yoshiharu Amano

Keyword(s):

Machine Learning ◽

Satellite System ◽

Training Data ◽

Support Vector ◽

Positioning Errors ◽

Automated Method ◽

Global Navigation Satellite ◽

Better Than ◽

Signal Correlation

This paper proposes a method for detecting non-line-of-sight (NLOS) multipath, which causes large positioning errors in a global navigation satellite system (GNSS). We use GNSS signal correlation output, which is the most primitive GNSS signal processing output, to detect NLOS multipath based on machine learning. The shape of the multi-correlator outputs is distorted due to the NLOS multipath. The features of the shape of the multi-correlator are used to discriminate the NLOS multipath. We implement two supervised learning methods, a support vector machine (SVM) and a neural network (NN), and compare their performance. In addition, we also propose an automated method of collecting training data for LOS and NLOS signals of machine learning. The evaluation of the proposed NLOS detection method in an urban environment confirmed that NN was better than SVM, and 97.7% of NLOS signals were correctly discriminated.

Download Full-text

Practical CO2—WAG Field Operational Designs Using Hybrid Numerical-Machine-Learning Approaches

Energies ◽

10.3390/en14041055 ◽

2021 ◽

Vol 14 (4) ◽

pp. 1055

Author(s):

Qian Sun ◽

William Ampomah ◽

Junyu You ◽

Martha Cather ◽

Robert Balch

Keyword(s):

Machine Learning ◽

Oil Recovery ◽

History Matching ◽

Optimization Problems ◽

Learning Technologies ◽

Petroleum Engineering ◽

Support Vector ◽

Learning Approaches ◽

Field Development ◽

Proxy Models

Machine-learning technologies have exhibited robust competences in solving many petroleum engineering problems. The accurate predictivity and fast computational speed enable a large volume of time-consuming engineering processes such as history-matching and field development optimization. The Southwest Regional Partnership on Carbon Sequestration (SWP) project desires rigorous history-matching and multi-objective optimization processes, which fits the superiorities of the machine-learning approaches. Although the machine-learning proxy models are trained and validated before imposing to solve practical problems, the error margin would essentially introduce uncertainties to the results. In this paper, a hybrid numerical machine-learning workflow solving various optimization problems is presented. By coupling the expert machine-learning proxies with a global optimizer, the workflow successfully solves the history-matching and CO2 water alternative gas (WAG) design problem with low computational overheads. The history-matching work considers the heterogeneities of multiphase relative characteristics, and the CO2-WAG injection design takes multiple techno-economic objective functions into accounts. This work trained an expert response surface, a support vector machine, and a multi-layer neural network as proxy models to effectively learn the high-dimensional nonlinear data structure. The proposed workflow suggests revisiting the high-fidelity numerical simulator for validation purposes. The experience gained from this work would provide valuable guiding insights to similar CO2 enhanced oil recovery (EOR) projects.

Download Full-text

Detection of Malicious Software by Analyzing Distinct Artifacts Using Machine Learning and Deep Learning Algorithms

Electronics ◽

10.3390/electronics10141694 ◽

2021 ◽

Vol 10 (14) ◽

pp. 1694

Author(s):

Mathew Ashik ◽

A. Jyothish ◽

S. Anandaram ◽

P. Vinod ◽

Francesco Mercaldo ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Learning ◽

Support Vector ◽

Malware Analysis ◽

Learning Approaches ◽

Dynamic Features ◽

System Calls ◽

Prevention Methods ◽

Structural Aspects

Malware is one of the most significant threats in today’s computing world since the number of websites distributing malware is increasing at a rapid rate. Malware analysis and prevention methods are increasingly becoming necessary for computer systems connected to the Internet. This software exploits the system’s vulnerabilities to steal valuable information without the user’s knowledge, and stealthily send it to remote servers controlled by attackers. Traditionally, anti-malware products use signatures for detecting known malware. However, the signature-based method does not scale in detecting obfuscated and packed malware. Considering that the cause of a problem is often best understood by studying the structural aspects of a program like the mnemonics, instruction opcode, API Call, etc. In this paper, we investigate the relevance of the features of unpacked malicious and benign executables like mnemonics, instruction opcodes, and API to identify a feature that classifies the executable. Prominent features are extracted using Minimum Redundancy and Maximum Relevance (mRMR) and Analysis of Variance (ANOVA). Experiments were conducted on four datasets using machine learning and deep learning approaches such as Support Vector Machine (SVM), Naïve Bayes, J48, Random Forest (RF), and XGBoost. In addition, we also evaluate the performance of the collection of deep neural networks like Deep Dense network, One-Dimensional Convolutional Neural Network (1D-CNN), and CNN-LSTM in classifying unknown samples, and we observed promising results using APIs and system calls. On combining APIs/system calls with static features, a marginal performance improvement was attained comparing models trained only on dynamic features. Moreover, to improve accuracy, we implemented our solution using distinct deep learning methods and demonstrated a fine-tuned deep neural network that resulted in an F1-score of 99.1% and 98.48% on Dataset-2 and Dataset-3, respectively.

Download Full-text

Analysis of the Nosema Cells Identification for Microscopic Images

Sensors ◽

10.3390/s21093068 ◽

2021 ◽

Vol 21 (9) ◽

pp. 3068

Author(s):

Soumaya Dghim ◽

Carlos M. Travieso-González ◽

Radim Burget

Keyword(s):

Neural Network ◽

Machine Learning ◽

Image Processing ◽

Deep Learning ◽

The Other ◽

Support Vector ◽

Learning Approaches ◽

Microscopic Images ◽

Trained Neural Network ◽

Nosema Disease

The use of image processing tools, machine learning, and deep learning approaches has become very useful and robust in recent years. This paper introduces the detection of the Nosema disease, which is considered to be one of the most economically significant diseases today. This work shows a solution for recognizing and identifying Nosema cells between the other existing objects in the microscopic image. Two main strategies are examined. The first strategy uses image processing tools to extract the most valuable information and features from the dataset of microscopic images. Then, machine learning methods are applied, such as a neural network (ANN) and support vector machine (SVM) for detecting and classifying the Nosema disease cells. The second strategy explores deep learning and transfers learning. Several approaches were examined, including a convolutional neural network (CNN) classifier and several methods of transfer learning (AlexNet, VGG-16 and VGG-19), which were fine-tuned and applied to the object sub-images in order to identify the Nosema images from the other object images. The best accuracy was reached by the VGG-16 pre-trained neural network with 96.25%.

Download Full-text