SMILE: a feature-based temporal abstraction framework for event-interval sequence classification

Data Mining and Knowledge Discovery ◽

10.1007/s10618-020-00719-3 ◽

2020 ◽

Author(s):

Jonathan Rebane ◽

Isak Karlsson ◽

Leon Bornemann ◽

Panagiotis Papapetrou

Keyword(s):

Empirical Evaluation ◽

Predictive Performance ◽

Classification Performance ◽

Drug Event ◽

Data Sets ◽

Temporal Abstraction ◽

Current State ◽

Temporal Intervals ◽

Feature Based ◽

Complex Features

AbstractIn this paper, we study the problem of classification of sequences of temporal intervals. Our main contribution is a novel framework, which we call , for extracting relevant features from interval sequences to construct classifiers. introduces the notion of utilizing random temporal abstraction features, we define as , as a means to capture information pertaining to class-discriminatory events which occur across the span of complete interval sequences. Our empirical evaluation is applied to a wide array of benchmark data sets and fourteen novel datasets for adverse drug event detection. We demonstrate how the introduction of simple sequential features, followed by progressively more complex features each improve classification performance. Importantly, this investigation demonstrates that significantly improves AUC performance over the current state-of-the-art. The investigation also reveals that the selection of underlying classification algorithm is important to achieve superior predictive performance, and how the number of features influences the performance of our framework.

Download Full-text

Does Removing/Replacing Missing Values Improve The Models' Classification Performances?

International Journal of Management & Information Systems (IJMIS) ◽

10.19030/ijmis.v16i3.7073 ◽

2012 ◽

Vol 16 (3) ◽

pp. 215

Author(s):

Jozef Zurada

Keyword(s):

Computer Simulation ◽

Missing Values ◽

Credit Scoring ◽

Predictive Performance ◽

Original Data ◽

Classification Performance ◽

Data Sets ◽

Data Set

The paper explores the effect of removing/replacing missing values on the classification performance of several models. The original data set, which contains a relatively large number of missing values, comes from the credit scoring context. This data set was not used to build the models, but it was converted to five other data sets with missing values either removed or replaced using different techniques. The models were built and tested on the five data sets. Preliminary computer simulation showed that the models created and tested on the four data sets in which missing values were replaced exhibited significantly better predictive performance than the model built and tested on the data set with missing values removed.

Download Full-text

GhoMR: Multi-Receptive Lightweight Residual Modules for Hyperspectral Classification

Sensors ◽

10.3390/s20236823 ◽

2020 ◽

Vol 20 (23) ◽

pp. 6823

Author(s):

Arijit Das ◽

Indrajit Saha ◽

Rafał Scherer

Keyword(s):

State Of The Art ◽

Receptive Fields ◽

Classification Performance ◽

Data Sets ◽

Spectral Bands ◽

Average Accuracy ◽

Complex Architectures ◽

Complex Features ◽

Hyperspectral Classification ◽

Simple Network

In recent years, hyperspectral images (HSIs) have attained considerable attention in computer vision (CV) due to their wide utility in remote sensing. Unlike images with three or lesser channels, HSIs have a large number of spectral bands. Recent works demonstrate the use of modern deep learning based CV techniques like convolutional neural networks (CNNs) for analyzing HSI. CNNs have receptive fields (RFs) fueled by learnable weights, which are trained to extract useful features from images. In this work, a novel multi-receptive CNN module called GhoMR is proposed for HSI classification. GhoMR utilizes blocks containing several RFs, extracting features in a residual fashion. Each RF extracts features which are used by other RFs to extract more complex features in a hierarchical manner. However, the higher the number of RFs, the greater the associated weights, thus heavier is the network. Most complex architectures suffer from this shortcoming. To tackle this, the recently found Ghost module is used as the basic building unit. Ghost modules address the feature redundancy in CNNs by extracting only limited features and performing cheap transformations on them, thus reducing the overall parameters in the network. To test the discriminative potential of GhoMR, a simple network called GhoMR-Net is constructed using GhoMR modules, and experiments are performed on three public HSI data sets—Indian Pines, University of Pavia, and Salinas Scene. The classification performance is measured using three metrics—overall accuracy (OA), Kappa coefficient (Kappa), and average accuracy (AA). Comparisons with ten state-of-the-art architectures are shown to demonstrate the effectiveness of the method further. Although lightweight, the proposed GhoMR-Net provides comparable or better performance than other networks. The PyTorch code for this study is made available at the iamarijit/GhoMR GitHub repository.

Download Full-text

J-CO: A Platform-Independent Framework for Managing Geo-Referenced JSON Data Sets

Electronics ◽

10.3390/electronics10050621 ◽

2021 ◽

Vol 10 (5) ◽

pp. 621

Author(s):

Giuseppe Psaila ◽

Paolo Fosci

Keyword(s):

Query Language ◽

Open Data ◽

Internet Technology ◽

Data Sets ◽

Specific Storage ◽

Current State ◽

Execution Engine ◽

Share Data ◽

Cloud Servers ◽

Computational Resources

Internet technology and mobile technology have enabled producing and diffusing massive data sets concerning almost every aspect of day-by-day life. Remarkable examples are social media and apps for volunteered information production, as well as Open Data portals on which public administrations publish authoritative and (often) geo-referenced data sets. In this context, JSON has become the most popular standard for representing and exchanging possibly geo-referenced data sets over the Internet.Analysts, wishing to manage, integrate and cross-analyze such data sets, need a framework that allows them to access possibly remote storage systems for JSON data sets, to retrieve and query data sets by means of a unique query language (independent of the specific storage technology), by exploiting possibly-remote computational resources (such as cloud servers), comfortably working on their PC in their office, more or less unaware of real location of resources. In this paper, we present the current state of the J-CO Framework, a platform-independent and analyst-oriented software framework to manipulate and cross-analyze possibly geo-tagged JSON data sets. The paper presents the general approach behind the J-CO Framework, by illustrating the query language by means of a simple, yet non-trivial, example of geographical cross-analysis. The paper also presents the novel features introduced by the re-engineered version of the execution engine and the most recent components, i.e., the storage service for large single JSON documents and the user interface that allows analysts to comfortably share data sets and computational resources with other analysts possibly working in different places of the Earth globe. Finally, the paper reports the results of an experimental campaign, which show that the execution engine actually performs in a more than satisfactory way, proving that our framework can be actually used by analysts to process JSON data sets.

Download Full-text

Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

Applied Sciences ◽

10.3390/app11020796 ◽

2021 ◽

Vol 11 (2) ◽

pp. 796

Author(s):

Alhanoof Althnian ◽

Duaa AlSaeed ◽

Heyam Al-Baity ◽

Amani Samha ◽

Alanoud Bin Dris ◽

...

Keyword(s):

Empirical Evaluation ◽

Classification Performance ◽

Support Vector ◽

Robust Model ◽

Original Distribution ◽

C4.5 Decision Tree ◽

Dataset Size ◽

Overall Performance ◽

Medical Domain ◽

The Impact

Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.

Download Full-text

Extracting image features for classification by two-tier genetic programming

10.26686/wgtn.13150940 ◽

2020 ◽

Author(s):

Harith Al-Sahaf ◽

A Song ◽

K Neshatian ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Image Classification ◽

Domain Knowledge ◽

Extraction Process ◽

High Accuracy ◽

Classification Performance ◽

Image Features ◽

Classification Methods ◽

Feature Based ◽

Second Tier

Image classification is a complex but important task especially in the areas of machine vision and image analysis such as remote sensing and face recognition. One of the challenges in image classification is finding an optimal set of features for a particular task because the choice of features has direct impact on the classification performance. However the goodness of a feature is highly problem dependent and often domain knowledge is required. To address these issues we introduce a Genetic Programming (GP) based image classification method, Two-Tier GP, which directly operates on raw pixels rather than features. The first tier in a classifier is for automatically defining features based on raw image input, while the second tier makes decision. Compared to conventional feature based image classification methods, Two-Tier GP achieved better accuracies on a range of different tasks. Furthermore by using the features defined by the first tier of these Two-Tier GP classifiers, conventional classification methods obtained higher accuracies than classifying on manually designed features. Analysis on evolved Two-Tier image classifiers shows that there are genuine features captured in the programs and the mechanism of achieving high accuracy can be revealed. The Two-Tier GP method has clear advantages in image classification, such as high accuracy, good interpretability and the removal of explicit feature extraction process. © 2012 IEEE.

Download Full-text

Locally Scaled and Stochastic Volatility Metropolis– Hastings Algorithms

Algorithms ◽

10.3390/a14120351 ◽

2021 ◽

Vol 14 (12) ◽

pp. 351

Author(s):

Wilson Tsakane Mongwe ◽

Rendani Mbuvha ◽

Tshilidzi Marwala

Keyword(s):

Monte Carlo ◽

Sample Size ◽

Stochastic Volatility ◽

Diffusion Processes ◽

Predictive Performance ◽

Jump Diffusion ◽

Effective Sample Size ◽

Jump Diffusion Processes ◽

Current State ◽

Mh Algorithm

Markov chain Monte Carlo (MCMC) techniques are usually used to infer model parameters when closed-form inference is not feasible, with one of the simplest MCMC methods being the random walk Metropolis–Hastings (MH) algorithm. The MH algorithm suffers from random walk behaviour, which results in inefficient exploration of the target posterior distribution. This method has been improved upon, with algorithms such as Metropolis Adjusted Langevin Monte Carlo (MALA) and Hamiltonian Monte Carlo being examples of popular modifications to MH. In this work, we revisit the MH algorithm to reduce the autocorrelations in the generated samples without adding significant computational time. We present the: (1) Stochastic Volatility Metropolis–Hastings (SVMH) algorithm, which is based on using a random scaling matrix in the MH algorithm, and (2) Locally Scaled Metropolis–Hastings (LSMH) algorithm, in which the scaled matrix depends on the local geometry of the target distribution. For both these algorithms, the proposal distribution is still Gaussian centred at the current state. The empirical results show that these minor additions to the MH algorithm significantly improve the effective sample rates and predictive performance over the vanilla MH method. The SVMH algorithm produces similar effective sample sizes to the LSMH method, with SVMH outperforming LSMH on an execution time normalised effective sample size basis. The performance of the proposed methods is also compared to the MALA and the current state-of-art method being the No-U-Turn sampler (NUTS). The analysis is performed using a simulation study based on Neal’s funnel and multivariate Gaussian distributions and using real world data modeled using jump diffusion processes and Bayesian logistic regression. Although both MALA and NUTS outperform the proposed algorithms on an effective sample size basis, the SVMH algorithm has similar or better predictive performance when compared to MALA and NUTS across the various targets. In addition, the SVMH algorithm outperforms the other MCMC algorithms on a normalised effective sample size basis on the jump diffusion processes datasets. These results indicate the overall usefulness of the proposed algorithms.

Download Full-text

Machine learning-based prediction system for rainfall-induced landslides in Benguet First Engineering District

10.31219/osf.io/csx6r ◽

2019 ◽

Author(s):

Zanya Reubenne D. Omadlao ◽

Nica Magdalena A. Tuguinay ◽

Ricarido Maglaqui Saturay

Keyword(s):

Machine Learning ◽

Daily Rainfall ◽

Predictive Performance ◽

Data Sets ◽

Prediction System ◽

True Positive ◽

Rainfall Thresholds ◽

Cumulative Rainfall ◽

Testing Data ◽

Positive Rate

A machine learning-based prediction system for rainfall-induced landslides in Benguet First Engineering District is proposed to address the landslide risk due to the climate and topography of Benguet province. It is intended to improve the decision support system for road management with regards to landslides, as implemented by the Department of Public Works and Highways Benguet First District Engineering Office. Supervised classification was applied to daily rainfall and landslide data for the Benguet First Engineering District covering the years 2014 to 2018 using scikit-learn. Various forms of cumulative rainfall values were used to predict landslide occurrence for a given day. Following typical machine learning workflows, rainfall-landslide data set was divided into training and testing data sets. Machine learning algorithms such as K-Nearest Neighbors, Gaussian Naïve Bayes, Support Vector Machine, Logistic Regression, Random Forest, Decision Tree, and AdaBoost were trained using the training data sets, and the trained models were used to make predictions based on the testing data sets. Predictive performance of the models vis-a-vis the testing data sets were compared using true positive rates, false positive rates, and the area under the Receiver Operating Characteristic Curve. Predictive performance of these models were then compared to 1-day cumulative rainfall thresholds commonly used for landslide predictions. Among the machine learning models evaluated, Gaussian Naïve Bayes has the best performance, with mean false positive rate, true positive rate and area under the curve scores of 7%, 76%, and 84% respectively. It also performs better than the 1-day cumulative rainfall thresholds. This research demonstrates the potential of machine learning for identifying temporal patterns in rainfall-induced landslides using minimal data input -- daily rainfall from a single synoptic station, and highway maintenance records. Such an approach may be tested and applied to similar problems in the field of disaster risk reduction and management.

Download Full-text

ASSESSING LIDAR TRAINING DATA QUANTITIES FOR CLASSIFICATION MODELS

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlvi-4-w4-2021-101-2021 ◽

2021 ◽

Vol XLVI-4/W4-2021 ◽

pp. 101-106

Author(s):

O. Majgaonkar ◽

K. Panchal ◽

D. Laefer ◽

M. Stanley ◽

Y. Zaki

Keyword(s):

Point Cloud ◽

Classification Performance ◽

Training Data ◽

Lidar Data ◽

Data Sets ◽

Classification Models ◽

Fundamental Properties ◽

Point Cloud Classification ◽

The Impact ◽

Insight Into

Abstract. Classifying objects within aerial Light Detection and Ranging (LiDAR) data is an essential task to which machine learning (ML) is applied increasingly. ML has been shown to be more effective on LiDAR than imagery for classification, but most efforts have focused on imagery because of the challenges presented by LiDAR data. LiDAR datasets are of higher dimensionality, discontinuous, heterogenous, spatially incomplete, and often scarce. As such, there has been little examination into the fundamental properties of the training data required for acceptable performance of classification models tailored for LiDAR data. The quantity of training data is one such crucial property, because training on different sizes of data provides insight into a model’s performance with differing data sets. This paper assesses the impact of training data size on the accuracy of PointNet, a widely used ML approach for point cloud classification. Subsets of ModelNet ranging from 40 to 9,843 objects were validated on a test set of 400 objects. Accuracy improved logarithmically; decelerating from 45 objects onwards, it slowed significantly at a training size of 2,000 objects, corresponding to 20,000,000 points. This work contributes to the theoretical foundation for development of LiDAR-focused models by establishing a learning curve, suggesting the minimum quantity of manually labelled data necessary for satisfactory classification performance and providing a path for further analysis of the effects of modifying training data characteristics.

Download Full-text

Iterative Part Design With a Virtual Reality Interface

Volume 1: Design for Manufacturing Conference ◽

10.1115/96-detc/dfm-1308 ◽

1996 ◽

Author(s):

S. N. Trika ◽

P. Banerjee ◽

R. L. Kashyap

Keyword(s):

Virtual Reality ◽

Computer Aided Design ◽

Three Dimensions ◽

Mechanical Parts ◽

Cad System ◽

Current State ◽

Feature Based ◽

New Feature ◽

Feature Based Design ◽

Aided Design

Abstract A virtual reality (VR) interface to a feature-based computer-aided design (CAD) system promises to provide a simple interface to a designer of mechanical parts, because it allows intuitive specification of design features such as holes, slots, and protrusions in three-dimensions. Given the current state of a part design, the designer is free to navigate around the part and in part cavities to specify the next feature. This method of feature specification also provides directives to the process-planner regarding the order in which the features may be manufactured. In iterative feature-based design, the existing part cavities represent constraints as to where the designer is allowed to navigate and place the new feature. The CAD system must be able to recognize the part cavities and enforce these constraints. Furthermore, the CAD system must be able to update its knowledge of part cavities when the new feature is added. In this paper, (i) we show how the CAD system can enforce the aforementioned constraints by exploiting the knowledge of part cavities and their adjacencies, and (ii) present efficient methods for updates of the set of part cavities when the designer adds a new feature.

Download Full-text

SKT

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476287 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2369-2382

Author(s):

Monica Chiosa ◽

Thomas B. Preußer ◽

Gustavo Alonso

Keyword(s):

Frequency Distribution ◽

Empirical Evaluation ◽

Large Data ◽

Cloud Service ◽

Data Sets ◽

Data Set ◽

Single Pass ◽

Trade Offs ◽

Significant Performance ◽

Spatial Architecture

Data analysts often need to characterize a data stream as a first step to its further processing. Some of the initial insights to be gained include, e.g., the cardinality of the data set and its frequency distribution. Such information is typically extracted by using sketch algorithms, now widely employed to process very large data sets in manageable space and in a single pass over the data. Often, analysts need more than one parameter to characterize the stream. However, computing multiple sketches becomes expensive even when using high-end CPUs. Exploiting the increasing adoption of hardware accelerators, this paper proposes SKT , an FPGA-based accelerator that can compute several sketches along with basic statistics (average, max, min, etc.) in a single pass over the data. SKT has been designed to characterize a data set by calculating its cardinality, its second frequency moment, and its frequency distribution. The design processes data streams coming either from PCIe or TCP/IP, and it is built to fit emerging cloud service architectures, such as Microsoft's Catapult or Amazon's AQUA. The paper explores the trade-offs of designing sketch algorithms on a spatial architecture and how to combine several sketch algorithms into a single design. The empirical evaluation shows how SKT on an FPGA offers a significant performance gain over high-end, server-class CPUs.

Download Full-text