Dataset Denoising Based on Manifold Assumption

Learning the knowledge hidden in the manifold-geometric distribution of the dataset is essential for many machine learning algorithms. However, geometric distribution is usually corrupted by noise, especially in the high-dimensional dataset. In this paper, we propose a denoising method to capture the “true” geometric structure of a high-dimensional nonrigid point cloud dataset by a variational approach. Firstly, we improve the Tikhonov model by adding a local structure term to make variational diffusion on the tangent space of the manifold. Then, we define the discrete Laplacian operator by graph theory and get an optimal solution by the Euler–Lagrange equation. Experiments show that our method could remove noise effectively on both synthetic scatter point cloud dataset and real image dataset. Furthermore, as a preprocessing step, our method could improve the robustness of manifold learning and increase the accuracy rate in the classification problem.

Download Full-text

Classification of Brainwaves for Sleep Stages by High-Dimensional FFT Features from EEG Signals

Applied Sciences ◽

10.3390/app10051797 ◽

2020 ◽

Vol 10 (5) ◽

pp. 1797 ◽

Cited By ~ 2

Author(s):

Mera Kartika Delimayanti ◽

Bedy Purnama ◽

Ngoc Giang Nguyen ◽

Mohammad Reza Faisal ◽

Kunti Robiatul Mahmudah ◽

...

Keyword(s):

Machine Learning ◽

Sleep Stage ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Sleep Stages ◽

Eeg Signals ◽

Stage Classification ◽

Sleep Stage Classification ◽

Low Dimensional

Manual classification of sleep stage is a time-consuming but necessary step in the diagnosis and treatment of sleep disorders, and its automation has been an area of active study. The previous works have shown that low dimensional fast Fourier transform (FFT) features and many machine learning algorithms have been applied. In this paper, we demonstrate utilization of features extracted from EEG signals via FFT to improve the performance of automated sleep stage classification through machine learning methods. Unlike previous works using FFT, we incorporated thousands of FFT features in order to classify the sleep stages into 2–6 classes. Using the expanded version of Sleep-EDF dataset with 61 recordings, our method outperformed other state-of-the art methods. This result indicates that high dimensional FFT features in combination with a simple feature selection is effective for the improvement of automated sleep stage classification.

Download Full-text

Competitive Caching with Machine Learned Advice

Journal of the ACM ◽

10.1145/3447579 ◽

2021 ◽

Vol 68 (4) ◽

pp. 1-25

Author(s):

Thodoris Lykouris ◽

Sergei Vassilvitskii

Keyword(s):

Online Algorithms ◽

Empirical Evaluation ◽

Optimal Solution ◽

Poor Performance ◽

Machine Learning Algorithms ◽

Average Error ◽

Generalization Error ◽

Worst Case ◽

Future Events ◽

Real World Datasets

Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution, as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error. In this work, we develop a framework for augmenting online algorithms with a machine learned predictor to achieve competitive ratios that provably improve upon unconditional worst-case lower bounds when the predictor has low error. Our approach treats the predictor as a complete black box and is not dependent on its inner workings or the exact distribution of its errors. We apply this framework to the traditional caching problem—creating an eviction strategy for a cache of size k . We demonstrate that naively following the oracle’s recommendations may lead to very poor performance, even when the average error is quite low. Instead, we show how to modify the Marker algorithm to take into account the predictions and prove that this combined approach achieves a competitive ratio that both (i) decreases as the predictor’s error decreases and (ii) is always capped by O (log k ), which can be achieved without any assistance from the predictor. We complement our results with an empirical evaluation of our algorithm on real-world datasets and show that it performs well empirically even when using simple off-the-shelf predictions.

Download Full-text

Classification of unlabeled online media

Scientific Reports ◽

10.1038/s41598-021-85608-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sakthi Kumar Arul Prakash ◽

Conrad Tucker

Keyword(s):

Social Media ◽

Real World ◽

Graphical Model ◽

Ground Truth ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Social Media Networks ◽

Online Social Media ◽

Wide Range

AbstractThis work investigates the ability to classify misinformation in online social media networks in a manner that avoids the need for ground truth labels. Rather than approach the classification problem as a task for humans or machine learning algorithms, this work leverages user–user and user–media (i.e.,media likes) interactions to infer the type of information (fake vs. authentic) being spread, without needing to know the actual details of the information itself. To study the inception and evolution of user–user and user–media interactions over time, we create an experimental platform that mimics the functionality of real-world social media networks. We develop a graphical model that considers the evolution of this network topology to model the uncertainty (entropy) propagation when fake and authentic media disseminates across the network. The creation of a real-world social media network enables a wide range of hypotheses to be tested pertaining to users, their interactions with other users, and with media content. The discovery that the entropy of user–user and user–media interactions approximate fake and authentic media likes, enables us to classify fake media in an unsupervised learning manner.

Download Full-text

Understanding Smartwatch Battery Utilization in the Wild

Sensors ◽

10.3390/s20133784 ◽

2020 ◽

Vol 20 (13) ◽

pp. 3784 ◽

Cited By ~ 1

Author(s):

Morteza Homayounfar ◽

Amirhossein Malekijoo ◽

Aku Visuri ◽

Chelsea Dobbins ◽

Ella Peltonen ◽

...

Keyword(s):

Real World ◽

Binary Classification ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Indexing Method ◽

In The Wild ◽

Art Research ◽

Battery Discharge ◽

Changes Over Time ◽

Application Developers

Smartwatch battery limitations are one of the biggest hurdles to their acceptability in the consumer market. To our knowledge, despite promising studies analyzing smartwatch battery data, there has been little research that has analyzed the battery usage of a diverse set of smartwatches in a real-world setting. To address this challenge, this paper utilizes a smartwatch dataset collected from 832 real-world users, including different smartwatch brands and geographic locations. First, we employ clustering to identify common patterns of smartwatch battery utilization; second, we introduce a transparent low-parameter convolutional neural network model, which allows us to identify the latent patterns of smartwatch battery utilization. Our model converts the battery consumption rate into a binary classification problem; i.e., low and high consumption. Our model has 85.3% accuracy in predicting high battery discharge events, outperforming other machine learning algorithms that have been used in state-of-the-art research. Besides this, it can be used to extract information from filters of our deep learning model, based on learned filters of the feature extractor, which is impossible for other models. Third, we introduce an indexing method that includes a longitudinal study to quantify smartwatch battery quality changes over time. Our novel findings can assist device manufacturers, vendors and application developers, as well as end-users, to improve smartwatch battery utilization.

Download Full-text

Scalable hierarchical clustering by composition rank vector encoding and tree structure

10.1101/2020.04.12.038026 ◽

2020 ◽

Author(s):

Xiao Lai ◽

Pu Tian

Keyword(s):

Machine Learning ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Machine Learning Algorithms ◽

Tree Structure ◽

Supervised Machine Learning ◽

High Dimensional ◽

Rank Vector ◽

Nonlinear Correlations

AbstractSupervised machine learning, especially deep learning based on a wide variety of neural network architectures, have contributed tremendously to fields such as marketing, computer vision and natural language processing. However, development of un-supervised machine learning algorithms has been a bottleneck of artificial intelligence. Clustering is a fundamental unsupervised task in many different subjects. Unfortunately, no present algorithm is satisfactory for clustering of high dimensional data with strong nonlinear correlations. In this work, we propose a simple and highly efficient hierarchical clustering algorithm based on encoding by composition rank vectors and tree structure, and demonstrate its utility with clustering of protein structural domains. No record comparison, which is an expensive and essential common step to all present clustering algorithms, is involved. Consequently, it achieves linear time and space computational complexity hierarchical clustering, thus applicable to arbitrarily large datasets. The key factor in this algorithm is definition of composition, which is dependent upon physical nature of target data and therefore need to be constructed case by case. Nonetheless, the algorithm is general and applicable to any high dimensional data with strong nonlinear correlations. We hope this algorithm to inspire a rich research field of encoding based clustering well beyond composition rank vector trees.

Download Full-text

A Comparison of Machine Learning Methods in a High-Dimensional Classification Problem

Business Systems Research Journal ◽

10.2478/bsrj-2014-0021 ◽

2014 ◽

Vol 5 (3) ◽

pp. 82-96 ◽

Cited By ~ 3

Author(s):

Marijana Zekić-Sušac ◽

Sanja Pfeifer ◽

Nataša Šarlija

Keyword(s):

Neural Network ◽

Machine Learning ◽

Classification Accuracy ◽

Classification Problem ◽

High Dimensional ◽

Nearest Neighbour ◽

Learning Methods ◽

Machine Learning Methods ◽

Dimensional Classification ◽

Artificial Neural

Abstract Background: Large-dimensional data modelling often relies on variable reduction methods in the pre-processing and in the post-processing stage. However, such a reduction usually provides less information and yields a lower accuracy of the model. Objectives: The aim of this paper is to assess the high-dimensional classification problem of recognizing entrepreneurial intentions of students by machine learning methods. Methods/Approach: Four methods were tested: artificial neural networks, CART classification trees, support vector machines, and k-nearest neighbour on the same dataset in order to compare their efficiency in the sense of classification accuracy. The performance of each method was compared on ten subsamples in a 10-fold cross-validation procedure in order to assess computing sensitivity and specificity of each model. Results: The artificial neural network model based on multilayer perceptron yielded a higher classification rate than the models produced by other methods. The pairwise t-test showed a statistical significance between the artificial neural network and the k-nearest neighbour model, while the difference among other methods was not statistically significant. Conclusions: Tested machine learning methods are able to learn fast and achieve high classification accuracy. However, further advancement can be assured by testing a few additional methodological refinements in machine learning methods.

Download Full-text

Using API Call Sequences for IoT Malware Classification Based on Convolutional Neural Networks

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s021819402140009x ◽

2021 ◽

Vol 31 (04) ◽

pp. 587-612

Author(s):

Qianguang Lin ◽

Ni Li ◽

Qi Qi ◽

Jiabin Hu

Keyword(s):

Time Series ◽

Time Series Data ◽

Comprehensive Evaluation ◽

Optimal Solution ◽

Classification Problem ◽

Experimental Results ◽

Series Data ◽

Chi Square ◽

Malware Classification ◽

Model Training

Internet of Things (IoT) devices built on different processor architectures have increasingly become targets of adversarial attacks. In this paper, we propose an algorithm for the malware classification problem of the IoT domain to deal with the increasingly severe IoT security threats. Application executions are represented by sequences of consecutive API calls. The time series of data is analyzed and filtered based on the improved information gains. It performs more effectively than chi-square statistics, in reducing the sequence lengths of input data meanwhile keeping the important information, according to the experimental results. We use a multi-layer convolutional neural network to classify various types of malwares, which is suitable for processing time series data. When the convolution window slides down the time sequence, it can obtain higher-level positions by collecting different sequence features, thereby understanding the characteristics of the corresponding sequence position. By comparing the iterative efficiency of different optimization algorithms in the model, we select an algorithm that can approximate the optimal solution to a small number of iterations to speed up the convergence of the model training. The experimental results from real world IoT malware sample show that the classification accuracy of this approach can reach more than 98%. Overall, our method has demonstrated practical suitability for IoT malware classification with high accuracies and low computational overheads by undergoing a comprehensive evaluation.

Download Full-text

An Inexact Projected Gradient Method for Sparsity-Constrained Quadratic Measurements Regression

Asia Pacific Journal of Operational Research ◽

10.1142/s0217595919400086 ◽

2019 ◽

Vol 36 (02) ◽

pp. 1940008

Author(s):

Jun Fan ◽

Liqun Wang ◽

Ailing Yan

Keyword(s):

Gradient Method ◽

Least Squares Method ◽

Optimal Solution ◽

High Dimensional ◽

Sparse Signals ◽

Projected Gradient Method ◽

Constrained Least Squares ◽

Projected Gradient ◽

High Dimensional Case ◽

Noisy Measurements

In this paper, we employ the sparsity-constrained least squares method to reconstruct sparse signals from the noisy measurements in high-dimensional case, and derive the existence of the optimal solution under certain conditions. We propose an inexact sparse-projected gradient method for numerical computation and discuss its convergence. Moreover, we present numerical results to demonstrate the efficiency of the proposed method.

Download Full-text

Blind Source Separation for the Aggregation of Machine Learning Algorithms: An Arrhythmia Classification Case

Electronics ◽

10.3390/electronics9030425 ◽

2020 ◽

Vol 9 (3) ◽

pp. 425

Author(s):

Krzysztof Gajowniczek ◽

Iga Grzegorczyk ◽

Michał Gostkowski ◽

Tomasz Ząbkowski

Keyword(s):

Blind Source Separation ◽

Cardiac Arrhythmias ◽

Classification Accuracy ◽

Recursive Partitioning ◽

Source Separation ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Physiological Signals ◽

New Approach ◽

Model Aggregation

In this work, we present an application of the blind source separation (BSS) algorithm to reduce false arrhythmia alarms and to improve the classification accuracy of artificial neural networks (ANNs). The research was focused on a new approach for model aggregation to deal with arrhythmia types that are difficult to predict. The data for analysis consisted of five-minute-long physiological signals (ECG, BP, and PLETH) registered for patients with cardiac arrhythmias. For each patient, the arrhythmia alarm occurred at the end of the signal. The data present a classification problem of whether the alarm is a true one—requiring attention or is false—should not have been generated. It was confirmed that BSS ANNs are able to detect four arrhythmias—asystole, ventricular tachycardia, ventricular fibrillation, and tachycardia—with higher classification accuracy than the benchmarking models, including the ANN, random forest, and recursive partitioning and regression trees. The overall challenge scores were between 63.2 and 90.7.

Download Full-text

Automatic Alignment of Medical Terminologies with General Dictionaries for an Efficient Information Retrieval

Information Retrieval in Biomedicine ◽

10.4018/978-1-60566-274-9.ch005 ◽

2010 ◽

pp. 78-105

Author(s):

Laura Diosan ◽

Alexandrina Rogozan ◽

Jean-Pierre Pécuchet

Keyword(s):

Medical Information ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Support Vector ◽

Svm Classifier ◽

Nearest Neighbour ◽

Automatic Alignment ◽

Alignment Task ◽

Efficient Information ◽

Different Levels

The automatic alignment between a specialized terminology used by librarians in order to index concepts and a general vocabulary employed by a neophyte user in order to retrieve medical information will certainly improve the performances of the search process, this being one of the purposes of the ANR VODEL project. The authors propose an original automatic alignment of definitions taken from different dictionaries that could be associated to the same concept although they may have different labels. The definitions are represented at different levels (lexical, semantic and syntactic), by using an original and shorter representation, which concatenates more similarities measures between definitions, instead of the classical one (as a vector of word occurrence, whose length equals the number of different words from all the dictionaries). The automatic alignment task is considered as a classification problem and three Machine Learning algorithms are utilised in order to solve it: a k Nearest Neighbour algorithm, an Evolutionary Algorithm and a Support Vector Machine algorithm. Numerical results indicate that the syntactic level of nouns seems to be the most important, determining the best performances of the SVM classifier.

Download Full-text