The Application of Machine Learning to a General Risk–Need Assessment Instrument in the Prediction of Criminal Recidivism

The Level of Service/Case Management Inventory (LS/CMI) is one of the most frequently used tools to assess criminogenic risk–need in justice-involved individuals. Meta-analytic research demonstrates strong predictive accuracy for various recidivism outcomes. In this exploratory study, we applied machine learning (ML) algorithms (decision trees, random forests, and support vector machines) to a data set with nearly 100,000 LS/CMI administrations to provincial corrections clientele in Ontario, Canada, and approximately 3 years follow-up. The overall accuracies and areas under the receiver operating characteristic curve (AUCs) were comparable, although ML outperformed LS/CMI in terms of predictive accuracy for the middle scores where it is hardest to predict the recidivism outcome. Moreover, ML improved the AUCs for individual scores to near 0.60, from 0.50 for the LS/CMI, indicating that ML also improves the ability to rank individuals according to their probability of recidivating. Potential considerations, applications, and future directions are discussed.

Download Full-text

A Review of Machine Learning Techniques for Anomaly Detection in Static Graphs

Implementing Computational Intelligence Techniques for Security Systems Design - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-7998-2418-3.ch007 ◽

2020 ◽

pp. 146-162

Author(s):

Hesham M. Al-Ammal

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Anomaly Detection ◽

Real Life ◽

Machine Learning Techniques ◽

Support Vector ◽

Learning Methods ◽

Data Set ◽

Learning Techniques ◽

Vector Machines

Detection of anomalies in a given data set is a vital step in several applications in cybersecurity; including intrusion detection, fraud, and social network analysis. Many of these techniques detect anomalies by examining graph-based data. Analyzing graphs makes it possible to capture relationships, communities, as well as anomalies. The advantage of using graphs is that many real-life situations can be easily modeled by a graph that captures their structure and inter-dependencies. Although anomaly detection in graphs dates back to the 1990s, recent advances in research utilized machine learning methods for anomaly detection over graphs. This chapter will concentrate on static graphs (both labeled and unlabeled), and the chapter summarizes some of these recent studies in machine learning for anomaly detection in graphs. This includes methods such as support vector machines, neural networks, generative neural networks, and deep learning methods. The chapter will reflect the success and challenges of using these methods in the context of graph-based anomaly detection.

Download Full-text

Learning to Identify At-Risk Students in Distance Education Using Interaction Counts

Revista de Informática Teórica e Aplicada ◽

10.22456/2175-2745.62211 ◽

2016 ◽

Vol 23 (2) ◽

pp. 124 ◽

Cited By ~ 2

Author(s):

Douglas Detoni ◽

Cristian Cechinel ◽

Ricardo Araujo Matsumura ◽

Daniela Francisco Brauner

Keyword(s):

Machine Learning ◽

At Risk ◽

At Risk Students ◽

Drop Out ◽

Support Vector ◽

Learning Models ◽

Data Set ◽

Student Dropout ◽

Vector Machines ◽

Machine Learning Models

Student dropout is one of the main problems faced by distance learning courses. One of the major challenges for researchers is to develop methods to predict the behavior of students so that teachers and tutors are able to identify at-risk students as early as possible and provide assistance before they drop out or fail in their courses. Machine Learning models have been used to predict or classify students in these settings. However, while these models have shown promising results in several settings, they usually attain these results using attributes that are not immediately transferable to other courses or platforms. In this paper, we provide a methodology to classify students using only interaction counts from each student. We evaluate this methodology on a data set from two majors based on the Moodle platform. We run experiments consisting of training and evaluating three machine learning models (Support Vector Machines, Naive Bayes and Adaboost decision trees) under different scenarios. We provide evidences that patterns from interaction counts can provide useful information for classifying at-risk students. This classification allows the customization of the activities presented to at-risk students (automatically or through tutors) as an attempt to avoid students drop out.

Download Full-text

Multi-Hazard Exposure Mapping Using Machine Learning Techniques: A Case Study from Iran

Remote Sensing ◽

10.3390/rs11161943 ◽

2019 ◽

Vol 11 (16) ◽

pp. 1943 ◽

Cited By ~ 15

Author(s):

Omid Rahmati ◽

Saleh Yousefi ◽

Zahra Kalantari ◽

Evelyn Uuemaa ◽

Teimur Teimurian ◽

...

Keyword(s):

Machine Learning ◽

State Of The Art ◽

Characteristic Curve ◽

Machine Learning Techniques ◽

Support Vector ◽

Mountainous Area ◽

Data Set ◽

Boosted Regression Tree ◽

Hazard Exposure ◽

Learning Techniques

Mountainous areas are highly prone to a variety of nature-triggered disasters, which often cause disabling harm, death, destruction, and damage. In this work, an attempt was made to develop an accurate multi-hazard exposure map for a mountainous area (Asara watershed, Iran), based on state-of-the art machine learning techniques. Hazard modeling for avalanches, rockfalls, and floods was performed using three state-of-the-art models—support vector machine (SVM), boosted regression tree (BRT), and generalized additive model (GAM). Topo-hydrological and geo-environmental factors were used as predictors in the models. A flood dataset (n = 133 flood events) was applied, which had been prepared using Sentinel-1-based processing and ground-based information. In addition, snow avalanche (n = 58) and rockfall (n = 101) data sets were used. The data set of each hazard type was randomly divided to two groups: Training (70%) and validation (30%). Model performance was evaluated by the true skill score (TSS) and the area under receiver operating characteristic curve (AUC) criteria. Using an exposure map, the multi-hazard map was converted into a multi-hazard exposure map. According to both validation methods, the SVM model showed the highest accuracy for avalanches (AUC = 92.4%, TSS = 0.72) and rockfalls (AUC = 93.7%, TSS = 0.81), while BRT demonstrated the best performance for flood hazards (AUC = 94.2%, TSS = 0.80). Overall, multi-hazard exposure modeling revealed that valleys and areas close to the Chalous Road, one of the most important roads in Iran, were associated with high and very high levels of risk. The proposed multi-hazard exposure framework can be helpful in supporting decision making on mountain social-ecological systems facing multiple hazards.

Download Full-text

Assessing Replicability of Machine Learning Results: An Introduction to Methods on Predictive Accuracy in Social Sciences

Social Science Computer Review ◽

10.1177/0894439319888445 ◽

2019 ◽

pp. 089443931988844

Author(s):

Ranjith Vijayakumar ◽

Mike W.-L. Cheung

Keyword(s):

Machine Learning ◽

Empirical Data ◽

Fixed Effects ◽

Predictive Accuracy ◽

Support Vector ◽

Learning Methods ◽

Data Set ◽

Replication Studies ◽

Machine Learning Methods ◽

Accuracy Measure

Machine learning methods have become very popular in diverse fields due to their focus on predictive accuracy, but little work has been conducted on how to assess the replicability of their findings. We introduce and adapt replication methods advocated in psychology to the aims and procedural needs of machine learning research. In Study 1, we illustrate these methods with the use of an empirical data set, assessing the replication success of a predictive accuracy measure, namely, R 2 on the cross-validated and test sets of the samples. We introduce three replication aims. First, tests of inconsistency examine whether single replications have successfully rejected the original study. Rejection will be supported if the 95% confidence interval (CI) of R 2 difference estimates between replication and original does not contain zero. Second, tests of consistency help support claims of successful replication. We can decide apriori on a region of equivalence, where population values of the difference estimates are considered equivalent for substantive reasons. The 90% CI of a different estimate lying fully within this region supports replication. Third, we show how to combine replications to construct meta-analytic intervals for better precision of predictive accuracy measures. In Study 2, R 2 is reduced from the original in a subset of replication studies to examine the ability of the replication procedures to distinguish true replications from nonreplications. We find that when combining studies sampled from same population to form meta-analytic intervals, random-effects methods perform best for cross-validated measures while fixed-effects methods work best for test measures. Among machine learning methods, regression was comparable to many complex methods, while support vector machine performed most reliably across a variety of scenarios. Social scientists who use machine learning to model empirical data can use these methods to enhance the reliability of their findings.

Download Full-text

Prediction of Heart Disease using Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1081.0982s1019 ◽

2019 ◽

Vol 8 (2S10) ◽

pp. 474-477

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Support Vector Machines ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Data Set ◽

Vector Machines ◽

Naive Bayes Classification ◽

Naïve Bayes Classification

Machine learning is one of the fast growing aspect in current world. Machine learning (ML) and Artificial Neural Network (ANN) are helpful in detection and diagnosis of various heart diseases. Naïve Bayes Classification is a vital approach of classification in machine learning. The heart disease consists of set of range disorders affecting the heart. It includes blood vessel problems such as irregular heart beat issues, weak heart muscles, congenital heart defects, cardio vascular disease and coronary artery disease. Coronary heart disorder is a familiar type of heart disease. It reduces the blood flow to the heart leading to a heart attack. In this paper the UCI machine learning repository data set consisting of patients suffering from heart disease is analyzed using Naïve Bayes classification and support vector machines. The classification accuracy of the patients suffering from heart disease is predicted using Naïve Bayes classification and support vector machines. Implementation is done using R language.

Download Full-text

Validation Study of the Accuracy of a Postoperative Nomogram for Recurrence After Radical Prostatectomy for Localized Prostate Cancer

Journal of Clinical Oncology ◽

10.1200/jco.2002.20.4.951 ◽

2002 ◽

Vol 20 (4) ◽

pp. 951-956 ◽

Cited By ~ 37

Author(s):

Markus Graefen ◽

Pierre I. Karakiewicz ◽

Ilias Cagiannos ◽

Eric Klein ◽

Patrick A. Kupelian ◽

...

Keyword(s):

Prostate Cancer ◽

Radical Prostatectomy ◽

Operating Characteristic ◽

Predictive Accuracy ◽

Characteristic Curve ◽

Extracapsular Extension ◽

Validation Data ◽

Data Set ◽

Time Period

PURPOSE: A postoperative nomogram for prostate cancer was developed at Baylor College of Medicine. This nomogram uses readily available clinical and pathologic variables to predict 7-year freedom from recurrence after radical prostatectomy. We evaluated the predictive accuracy of the nomogram when applied to patients of four international institutions. PATIENTS AND METHODS: Clinical and pathologic data of 2,908 patients were supplied for validation, and 2,465 complete records were used. Nomogram-predicted probabilities of 7-year freedom from recurrence were compared with actual follow-up in two ways. First, the area under the receiver operating characteristic curve (AUC) was calculated for all patients and stratified by the time period of surgery. Second, calibration of the nomogram was achieved by comparing the predicted freedom from recurrence with that of an ideal nomogram. For patients in whom the pathologic report does not distinguish between focal and established extracapsular extension (an input variable of the nomogram), two separate calculations were performed assuming one or the other. RESULTS: The overall AUC was 0.80 when applied to the validation data set, with individual institution AUCs ranging from 0.77 to 0.82. The predictive accuracy of the nomogram was apparently higher in patients who were operated on between 1997 and 2000 (AUC, 0.83) compared with those treated between 1987 and 1996 (AUC, 0.78). Nomogram predictions of 7-year freedom from recurrence were within 10% of an ideal nomogram. CONCLUSION: The postoperative Baylor nomogram was accurate when applied at international treatment institutions. Our results suggest that accurate predictions may be expected when using this nomogram across different patient populations.

Download Full-text

Iterative Reweighted Noninteger Norm Regularizing SVM for Gene Expression Data Classification

Computational and Mathematical Methods in Medicine ◽

10.1155/2013/768404 ◽

2013 ◽

Vol 2013 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Jianwei Liu ◽

Shuang Cheng Li ◽

Xionglin Luo

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Adaptive Learning ◽

Predictive Accuracy ◽

Learning Algorithm ◽

Training Dataset ◽

Support Vector ◽

Data Set ◽

Cancer Data ◽

Public Data

Support vector machine is an effective classification and regression method that uses machine learning theory to maximize the predictive accuracy while avoiding overfitting of data.L2regularization has been commonly used. If the training dataset contains many noise variables,L1regularization SVM will provide a better performance. However, bothL1andL2are not the optimal regularization method when handing a large number of redundant values and only a small amount of data points is useful for machine learning. We have therefore proposed an adaptive learning algorithm using the iterative reweightedp-norm regularization support vector machine for 0 <p≤ 2. A simulated data set was created to evaluate the algorithm. It was shown that apvalue of 0.8 was able to produce better feature selection rate with high accuracy. Four cancer data sets from public data banks were used also for the evaluation. All four evaluations show that the new adaptive algorithm was able to achieve the optimal prediction error using apvalue less thanL1norm. Moreover, we observe that the proposedLppenalty is more robust to noise variables than theL1andL2penalties.

Download Full-text

Social Reminiscence in Older Adults’ Everyday Conversations: Automated Detection Using Natural Language Processing and Machine Learning (Preprint)

10.2196/preprints.19133 ◽

2020 ◽

Author(s):

Andrea Ferrario ◽

Burcu Demiray ◽

Kristina Yordanova ◽

Minxia Luo ◽

Mike Martin

Keyword(s):

Machine Learning ◽

Older Adults ◽

Support Vector Machines ◽

Learning Strategies ◽

Support Vector ◽

Bag Of Words ◽

Word Embeddings ◽

Data Set ◽

Extreme Gradient Boosting ◽

Vector Machines

BACKGROUND Reminiscence is the act of thinking or talking about personal experiences that occurred in the past. It is a central task of old age that is essential for healthy aging, and it serves multiple functions, such as decision-making and introspection, transmitting life lessons, and bonding with others. The study of social reminiscence behavior in everyday life can be used to generate data and detect reminiscence from general conversations. OBJECTIVE The aims of this original paper are to (1) preprocess coded transcripts of conversations in German of older adults with natural language processing (NLP), and (2) implement and evaluate learning strategies using different NLP features and machine learning algorithms to detect reminiscence in a corpus of transcripts. METHODS The methods in this study comprise (1) collecting and coding of transcripts of older adults’ conversations in German, (2) preprocessing transcripts to generate NLP features (bag-of-words models, part-of-speech tags, pretrained German word embeddings), and (3) training machine learning models to detect reminiscence using random forests, support vector machines, and adaptive and extreme gradient boosting algorithms. The data set comprises 2214 transcripts, including 109 transcripts with reminiscence. Due to class imbalance in the data, we introduced three learning strategies: (1) class-weighted learning, (2) a meta-classifier consisting of a voting ensemble, and (3) data augmentation with the Synthetic Minority Oversampling Technique (SMOTE) algorithm. For each learning strategy, we performed cross-validation on a random sample of the training data set of transcripts. We computed the area under the curve (AUC), the average precision (AP), precision, recall, as well as F1 score and specificity measures on the test data, for all combinations of NLP features, algorithms, and learning strategies. RESULTS Class-weighted support vector machines on bag-of-words features outperformed all other classifiers (AUC=0.91, AP=0.56, precision=0.5, recall=0.45, F1=0.48, specificity=0.98), followed by support vector machines on SMOTE-augmented data and word embeddings features (AUC=0.89, AP=0.54, precision=0.35, recall=0.59, F1=0.44, specificity=0.94). For the meta-classifier strategy, adaptive and extreme gradient boosting algorithms trained on word embeddings and bag-of-words outperformed all other classifiers and NLP features; however, the performance of the meta-classifier learning strategy was lower compared to other strategies, with highly imbalanced precision-recall trade-offs. CONCLUSIONS This study provides evidence of the applicability of NLP and machine learning pipelines for the automated detection of reminiscence in older adults’ everyday conversations in German. The methods and findings of this study could be relevant for designing unobtrusive computer systems for the real-time detection of social reminiscence in the everyday life of older adults and classifying their functions. With further improvements, these systems could be deployed in health interventions aimed at improving older adults’ well-being by promoting self-reflection and suggesting coping strategies to be used in the case of dysfunctional reminiscence cases, which can undermine physical and mental health.

Download Full-text

Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study

Journal of Medical Internet Research ◽

10.2196/17478 ◽

2020 ◽

Vol 22 (8) ◽

pp. e17478 ◽

Cited By ~ 1

Author(s):

Shyam Visweswaran ◽

Jason B Colditz ◽

Patrick O’Halloran ◽

Na-Rae Han ◽

Sanya B Taneja ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Surveillance System ◽

Short Term Memory ◽

Characteristic Curve ◽

Superior Performance ◽

Support Vector ◽

Data Set ◽

Machine Learning Classifiers ◽

Learning Classifiers

Background Twitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers offer the advantage of requiring fewer annotated tweets by leveraging the large numbers of readily available unannotated tweets. Objective This study aims to derive and evaluate traditional and deep learning classifiers that can identify tweets relevant to vaping, tweets of a commercial nature, and tweets with provape sentiments. Methods We continuously collected tweets that matched vaping-related keywords over 2 months from August 2018 to October 2018. From this data set of tweets, a set of 4000 tweets was selected, and each tweet was manually annotated for relevance (vape relevant or not), commercial nature (commercial or not), and sentiment (provape or not). Using the annotated data, we derived traditional classifiers that included logistic regression, random forest, linear support vector machine, and multinomial naive Bayes. In addition, using the annotated data set and a larger unannotated data set of tweets, we derived deep learning classifiers that included a convolutional neural network (CNN), long short-term memory (LSTM) network, LSTM-CNN network, and bidirectional LSTM (BiLSTM) network. The unannotated tweet data were used to derive word vectors that deep learning classifiers can leverage to improve performance. Results LSTM-CNN performed the best with the highest area under the receiver operating characteristic curve (AUC) of 0.96 (95% CI 0.93-0.98) for relevance, all deep learning classifiers including LSTM-CNN performed better than the traditional classifiers with an AUC of 0.99 (95% CI 0.98-0.99) for distinguishing commercial from noncommercial tweets, and BiLSTM performed the best with an AUC of 0.83 (95% CI 0.78-0.89) for provape sentiment. Overall, LSTM-CNN performed the best across all 3 classification tasks. Conclusions We derived and evaluated traditional machine learning and deep learning classifiers to identify vaping-related relevant, commercial, and provape tweets. Overall, deep learning classifiers such as LSTM-CNN had superior performance and had the added advantage of requiring no preprocessing. The performance of these classifiers supports the development of a vaping surveillance system.

Download Full-text

Water Hazard Prediction using Machine Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a4245.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 1451-1457

Keyword(s):

Machine Learning ◽

Life Forms ◽

Support Vector ◽

Data Sets ◽

Data Set ◽

Climate Conditions ◽

Drought Hazard ◽

Vector Machines ◽

Water Hazard ◽

Water Hazards

Water is the most essential need of all life forms. This essential need can also create hazards to us which comes in the form of water hazards (flood and drought). Catastrophic events, for example, flood is respected to be brought about by outrageous climate conditions just as changes in worldwide and territorial atmosphere. If precautions are not taken beforehand it becomes more and more difficult to control when it occurs. This study aimed to forecast both flood and drought using Machine Learning (ML). So as to have a clear and precise forecast of flood and drought hazard is fundamental to play out a specific and multivariate analysis among the various kinds of data sets. Multi variate Analysis means that all measurable strategies will concurrently analyses manifold variables. Among multi variate investigation, ML will give expanding levels of exactness, accuracy, and productivity by finding designs in enormous and variegated data sets. Basically, ML methods naturally acquires proficiency data from dataset. This is finished by the way toward learning, by which the calculation can sum up past the models given via preparing information in info. AI is intriguing for forecasts since it adjusts the goal methodologies to the highlights of the data set. This uniqueness can be utilized to foresee outrageous from high factor information, as on account of the risks. This paper proposes systems and contextual analysis on the application on ML calculations on water hazard occurrence forecast. Especially the examination will concentrate on the utilization of Support Vector Machines and Artificial Neural Networks on a multivariate arrangement of information identified with water level of lakes in and around Chennai and measurement of rainfall in the lakes.

Download Full-text