Learning Realistic Patterns from Visually Unrealistic Stimuli: Generalization and Data Anonymization

Good training data is a prerequisite to develop useful Machine Learning applications. However, in many domains existing data sets cannot be shared due to privacy regulations (e.g., from medical studies). This work investigates a simple yet unconventional approach for anonymized data synthesis to enable third parties to benefit from such anonymized data. We explore the feasibility of learning implicitly from visually unrealistic, task-relevant stimuli, which are synthesized by exciting the neurons of a trained deep neural network. As such, neuronal excitation can be used to generate synthetic stimuli. The stimuli data is used to train new classification models. Furthermore, we extend this framework to inhibit representations that are associated with specific individuals. We use sleep monitoring data from both an open and a large closed clinical study, and Electroencephalogram sleep stage classification data, to evaluate whether (1) end-users can create and successfully use customized classification models, and (2) the identity of participants in the study is protected. Extensive comparative empirical investigation shows that different algorithms trained on the stimuli are able to generalize successfully on the same task as the original model. Architectural and algorithmic similarity between new and original models play an important role in performance. For similar architectures, the performance is close to that of using the original data (e.g., Accuracy difference of 0.56%-3.82%, Kappa coefficient difference of 0.02-0.08). Further experiments show that the stimuli can provide state-ofthe-art resilience against adversarial association and membership inference attacks.

Download Full-text

ASSESSING LIDAR TRAINING DATA QUANTITIES FOR CLASSIFICATION MODELS

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlvi-4-w4-2021-101-2021 ◽

2021 ◽

Vol XLVI-4/W4-2021 ◽

pp. 101-106

Author(s):

O. Majgaonkar ◽

K. Panchal ◽

D. Laefer ◽

M. Stanley ◽

Y. Zaki

Keyword(s):

Point Cloud ◽

Classification Performance ◽

Training Data ◽

Lidar Data ◽

Data Sets ◽

Classification Models ◽

Fundamental Properties ◽

Point Cloud Classification ◽

The Impact ◽

Insight Into

Abstract. Classifying objects within aerial Light Detection and Ranging (LiDAR) data is an essential task to which machine learning (ML) is applied increasingly. ML has been shown to be more effective on LiDAR than imagery for classification, but most efforts have focused on imagery because of the challenges presented by LiDAR data. LiDAR datasets are of higher dimensionality, discontinuous, heterogenous, spatially incomplete, and often scarce. As such, there has been little examination into the fundamental properties of the training data required for acceptable performance of classification models tailored for LiDAR data. The quantity of training data is one such crucial property, because training on different sizes of data provides insight into a model’s performance with differing data sets. This paper assesses the impact of training data size on the accuracy of PointNet, a widely used ML approach for point cloud classification. Subsets of ModelNet ranging from 40 to 9,843 objects were validated on a test set of 400 objects. Accuracy improved logarithmically; decelerating from 45 objects onwards, it slowed significantly at a training size of 2,000 objects, corresponding to 20,000,000 points. This work contributes to the theoretical foundation for development of LiDAR-focused models by establishing a learning curve, suggesting the minimum quantity of manually labelled data necessary for satisfactory classification performance and providing a path for further analysis of the effects of modifying training data characteristics.

Download Full-text

Deep residual detection of radio frequency interference for FAST

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz3521 ◽

2020 ◽

Vol 492 (1) ◽

pp. 1421-1431 ◽

Cited By ~ 4

Author(s):

Zhicheng Yang ◽

Ce Yu ◽

Jian Xiao ◽

Bo Zhang

Keyword(s):

Radio Frequency ◽

Large Data ◽

High Sensitivity ◽

Original Data ◽

Training Data ◽

Radio Frequency Interference ◽

Data Sets ◽

Data Set ◽

Time Required ◽

Key Steps

ABSTRACT Radio frequency interference (RFI) detection and excision are key steps in the data-processing pipeline of the Five-hundred-meter Aperture Spherical radio Telescope (FAST). Because of its high sensitivity and large data rate, FAST requires more accurate and efficient RFI flagging methods than its counterparts. In the last decades, approaches based upon artificial intelligence (AI), such as codes using convolutional neural networks (CNNs), have been proposed to identify RFI more reliably and efficiently. However, RFI flagging of FAST data with such methods has often proved to be erroneous, with further manual inspections required. In addition, network construction as well as preparation of training data sets for effective RFI flagging has imposed significant additional workloads. Therefore, rapid deployment and adjustment of AI approaches for different observations is impractical to implement with existing algorithms. To overcome such problems, we propose a model called RFI-Net. With the input of raw data without any processing, RFI-Net can detect RFI automatically, producing corresponding masks without any alteration of the original data. Experiments with RFI-Net using simulated astronomical data show that our model has outperformed existing methods in terms of both precision and recall. Besides, compared with other models, our method can obtain the same relative accuracy with fewer training data, thus reducing the effort and time required to prepare the training data set. Further, the training process of RFI-Net can be accelerated, with overfittings being minimized, compared with other CNN codes. The performance of RFI-Net has also been evaluated with observing data obtained by FAST and the Bleien Observatory. Our results demonstrate the ability of RFI-Net to accurately identify RFI with fine-grained, high-precision masks that required no further modification.

Download Full-text

Classification on Top of Data Cube

Web Mining ◽

10.4018/978-1-59140-414-9.ch009 ◽

2011 ◽

pp. 189-207

Author(s):

Lixin Fu

Keyword(s):

Relational Databases ◽

Large Data ◽

Original Data ◽

Data Cube ◽

Large Data Sets ◽

Data Sets ◽

Classification Models ◽

Data Cubes ◽

Analytical Processing ◽

Free Classification

Currently, data classification is either performed on data stored in relational databases or performed on data stored in flat files. The problem with these approaches is that for large data sets, they often need multiple scans of the original data and thus are often infeasible in many applications. In this chapter we propose to deploy classification on top of OLAP (online analytical processing) and data cube systems. First, we compute the statistics in various combinations of the attributes known as data cubes. The statistics are then used to derive classification models. In this way, we only scan the original data once, which improves the performance of classification significantly. Furthermore, our new classifier will provide “free” classification by eliminating the dominating I/O overhead of scanning the massive original data. An architecture that integrates database, data cube, and data mining is given and three new cube-based classifiers are presented and evaluated.

Download Full-text

EvoSplit: An Evolutionary Approach to Split a Multi-Label Data Set into Disjoint Subsets

Applied Sciences ◽

10.3390/app11062823 ◽

2021 ◽

Vol 11 (6) ◽

pp. 2823

Author(s):

Francisco Florez-Revuelta

Keyword(s):

Machine Learning ◽

Image Data ◽

Original Data ◽

Supervised Machine Learning ◽

Evolutionary Approach ◽

Data Sets ◽

Data Set ◽

Label Data ◽

Machine Learning Applications ◽

Label Distribution

This paper presents a new evolutionary approach, EvoSplit, for the distribution of multi-label data sets into disjoint subsets for supervised machine learning. Currently, data set providers either divide a data set randomly or using iterative stratification, a method that aims to maintain the label (or label pair) distribution of the original data set into the different subsets. Following the same aim, this paper first introduces a single-objective evolutionary approach that tries to obtain a split that maximizes the similarity between those distributions independently. Second, a new multi-objective evolutionary algorithm is presented to maximize the similarity considering simultaneously both distributions (labels and label pairs). Both approaches are validated using well-known multi-label data sets as well as large image data sets currently used in computer vision and machine learning applications. EvoSplit improves the splitting of a data set in comparison to the iterative stratification following different measures: Label Distribution, Label Pair Distribution, Examples Distribution, folds and fold-label pairs with zero positive examples.

Download Full-text

L-Diversity for Data Analysis: Data Swapping with Customized Clustering

Journal of Physics Conference Series ◽

10.1088/1742-6596/2089/1/012050 ◽

2021 ◽

Vol 2089 (1) ◽

pp. 012050

Author(s):

Thirupathi Lingala ◽

C Kishor Kumar Reddy ◽

B V Ramana Murthy ◽

Rajashekar Shastry ◽

YVSS Pragathi

Keyword(s):

Data Analysis ◽

Personal Information ◽

Original Data ◽

Analysis Data ◽

Privacy Concerns ◽

Data Anonymization ◽

Data Analyst ◽

Anonymized Data

Abstract Data anonymization should support the analysts who intend to use the anonymized data. Releasing datasets that contain personal information requires anonymization that balances privacy concerns while preserving the utility of the data. This work shows how choosing anonymization techniques with the data analyst requirements in mind improves effectiveness quantitatively, by minimizing the discrepancy between querying the original data versus the anonymized result, and qualitatively, by simplifying the workflow for querying the data.

Download Full-text

Image Augmentation based on GAN deep learning approach with Textual Content Descriptors

Journal of Information Technology and Digital World - September 2019 ◽

10.36548/jitdw.2021.3.005 ◽

2021 ◽

Vol 3 (3) ◽

pp. 210-225

Author(s):

Judy Simon

Keyword(s):

Computer Vision ◽

Synthetic Data ◽

Original Data ◽

Generative Models ◽

Training Data ◽

Generative Adversarial Networks ◽

Data Sets ◽

Biological Vision ◽

Adversarial Networks ◽

Textual Content

Computer vision, also known as computational visual perception, is a branch of artificial intelligence that allows computers to interpret digital pictures and videos in a manner comparable to biological vision. It entails the development of techniques for simulating biological vision. The aim of computer vision is to extract more meaningful information from visual input than that of a biological vision. Computer vision is exploding due to the avalanche of data being produced today. Powerful generative models, such as Generative Adversarial Networks (GANs), are responsible for significant advances in the field of picture creation. The focus of this research is to concentrate on textual content descriptors in the images used by GANs to generate synthetic data from the MNIST dataset to either supplement or replace the original data while training classifiers. This can provide better performance than other traditional image enlarging procedures due to the good handling of synthetic data. It shows that training classifiers on synthetic data are as effective as training them on pure data alone, and it also reveals that, for small training data sets, supplementing the dataset by first training GANs on the data may lead to a significant increase in classifier performance.

Download Full-text

A deep transfer learning approach for wearable sleep stage classification with photoplethysmography

npj Digital Medicine ◽

10.1038/s41746-021-00510-8 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Mustafa Radha ◽

Pedro Fonseca ◽

Arnaud Moreau ◽

Marco Ross ◽

Andreas Cerny ◽

...

Keyword(s):

Transfer Learning ◽

Learning Strategy ◽

Sleep Stage ◽

Sleep Apnoea ◽

Training Data ◽

Data Set ◽

Data Intensive ◽

Deep Recurrent Neural Network ◽

Stage Classification ◽

Sleep Stage Classification

AbstractUnobtrusive home sleep monitoring using wrist-worn wearable photoplethysmography (PPG) could open the way for better sleep disorder screening and health monitoring. However, PPG is rarely included in large sleep studies with gold-standard sleep annotation from polysomnography. Therefore, training data-intensive state-of-the-art deep neural networks is challenging. In this work a deep recurrent neural network is first trained using a large sleep data set with electrocardiogram (ECG) data (292 participants, 584 recordings) to perform 4-class sleep stage classification (wake, rapid-eye-movement, N1/N2, and N3). A small part of its weights is adapted to a smaller, newer PPG data set (60 healthy participants, 101 recordings) through three variations of transfer learning. Best results (Cohen’s kappa of 0.65 ± 0.11, accuracy of 76.36 ± 7.57%) were achieved with the domain and decision combined transfer learning strategy, significantly outperforming the PPG-trained and ECG-trained baselines. This performance for PPG-based 4-class sleep stage classification is unprecedented in literature, bringing home sleep stage monitoring closer to clinical use. The work demonstrates the merit of transfer learning in developing reliable methods for new sensor technologies by reusing similar, older non-wearable data sets. Further study should evaluate our approach in patients with sleep disorders such as insomnia and sleep apnoea.

Download Full-text

Beyond Cross-Validation—Accuracy Estimation for Incremental and Active Learning Models

Machine Learning and Knowledge Extraction ◽

10.3390/make2030018 ◽

2020 ◽

Vol 2 (3) ◽

pp. 327-346

Author(s):

Christian Limberg ◽

Heiko Wersing ◽

Helge Ritter

Keyword(s):

Active Learning ◽

Cross Validation ◽

Recognition Task ◽

Training Data ◽

Data Sets ◽

Accuracy Estimation ◽

Benchmark Data ◽

Machine Learning Applications ◽

Novel Method ◽

Human Teacher

For incremental machine-learning applications it is often important to robustly estimate the system accuracy during training, especially if humans perform the supervised teaching. Cross-validation and interleaved test/train error are here the standard supervised approaches. We propose a novel semi-supervised accuracy estimation approach that clearly outperforms these two methods. We introduce the Configram Estimation (CGEM) approach to predict the accuracy of any classifier that delivers confidences. By calculating classification confidences for unseen samples, it is possible to train an offline regression model, capable of predicting the classifier’s accuracy on novel data in a semi-supervised fashion. We evaluate our method with several diverse classifiers and on analytical and real-world benchmark data sets for both incremental and active learning. The results show that our novel method improves accuracy estimation over standard methods and requires less supervised training data after deployment of the model. We demonstrate the application of our approach to a challenging robot object recognition task, where the human teacher can use our method to judge sufficient training.

Download Full-text

Causal Probabilistic Network and Power Spectral Estimation Used in Sleep Stage Classification

Methods of Information in Medicine ◽

10.1055/s-0038-1636853 ◽

1997 ◽

Vol 36 (04/05) ◽

pp. 41-46

Author(s):

A. Kjaer ◽

W. Jensen ◽

T. Dyrby ◽

L. Andreasen ◽

J. Andersen ◽

...

Keyword(s):

Sleep Stage ◽

Spectral Estimation ◽

Interrater Agreement ◽

Probabilistic Networks ◽

Probabilistic Network ◽

Spectral Bands ◽

Power Spectral ◽

Stage Classification ◽

Sleep Stage Classification ◽

The Brain

Abstract.A new method for sleep-stage classification using a causal probabilistic network as automatic classifier has been implemented and validated. The system uses features from the primary sleep signals from the brain (EEG) and the eyes (AOG) as input. From the EEG, features are derived containing spectral information which is used to classify power in the classical spectral bands, sleep spindles and K-complexes. From AOG, information on rapid eye movements is derived. Features are extracted every 2 seconds. The CPN-based sleep classifier was implemented using the HUGIN system, an application tool to handle causal probabilistic networks. The results obtained using different training approaches show agreements ranging from 68.7 to 70.7% between the system and the two experts when a pooled agreement is computed over the six subjects. As a comparison, the interrater agreement between the two experts was found to be 71.4%, measured also over the six subjects.

Download Full-text

Classification of Brainwaves for Sleep Stages by High-Dimensional FFT Features from EEG Signals

Applied Sciences ◽

10.3390/app10051797 ◽

2020 ◽

Vol 10 (5) ◽

pp. 1797 ◽

Cited By ~ 2

Author(s):

Mera Kartika Delimayanti ◽

Bedy Purnama ◽

Ngoc Giang Nguyen ◽

Mohammad Reza Faisal ◽

Kunti Robiatul Mahmudah ◽

...

Keyword(s):

Machine Learning ◽

Sleep Stage ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Sleep Stages ◽

Eeg Signals ◽

Stage Classification ◽

Sleep Stage Classification ◽

Low Dimensional

Manual classification of sleep stage is a time-consuming but necessary step in the diagnosis and treatment of sleep disorders, and its automation has been an area of active study. The previous works have shown that low dimensional fast Fourier transform (FFT) features and many machine learning algorithms have been applied. In this paper, we demonstrate utilization of features extracted from EEG signals via FFT to improve the performance of automated sleep stage classification through machine learning methods. Unlike previous works using FFT, we incorporated thousands of FFT features in order to classify the sleep stages into 2–6 classes. Using the expanded version of Sleep-EDF dataset with 61 recordings, our method outperformed other state-of-the art methods. This result indicates that high dimensional FFT features in combination with a simple feature selection is effective for the improvement of automated sleep stage classification.

Download Full-text