Unsupervised Outlier Detection in Multidimensional Data

Multidimensional Data ◽

High Dimensions ◽

Comprehensive Performance ◽

Benchmark Datasets ◽

Distance Vector ◽

Detection Schemes ◽

Unsupervised Outlier Detection ◽

Abstract Detection and removal of outliers in a dataset is a fundamental preprocessing task without which the analysis of the data can be misleading. Furthermore, the existence of anomalies in the data can heavily degrade the performance of machine learning algorithms. In order to detect the anomalies in a dataset in an unsupervised manner, some novel statistical techniques are proposed in this paper. The proposed techniques are based on statistical methods considering data compactness and other properties. The newly proposed ideas are found efficient in terms of performance, ease of implementation, and computational complexity. Furthermore, two proposed techniques presented in this paper use only a single dimensional distance vector to detect the outliers, so irrespective of the data’s high dimensions, the techniques remain computationally inexpensive and feasible. Comprehensive performance analysis of the proposed anomaly detection schemes is presented in the paper, and the newly proposed schemes are found better than the state-of-the-art methods when tested on several benchmark datasets.

Unsupervised outlier detection in multidimensional data

Journal Of Big Data ◽

10.1186/s40537-021-00469-z ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Atiq ur Rehman ◽

Samir Brahim Belhaouari

Keyword(s):

State Of The Art ◽

Multidimensional Data ◽

Statistical Techniques ◽

High Dimensions ◽

Comprehensive Performance ◽

Benchmark Datasets ◽

Detection Schemes ◽

Unsupervised Outlier Detection ◽

AbstractDetection and removal of outliers in a dataset is a fundamental preprocessing task without which the analysis of the data can be misleading. Furthermore, the existence of anomalies in the data can heavily degrade the performance of machine learning algorithms. In order to detect the anomalies in a dataset in an unsupervised manner, some novel statistical techniques are proposed in this paper. The proposed techniques are based on statistical methods considering data compactness and other properties. The newly proposed ideas are found efficient in terms of performance, ease of implementation, and computational complexity. Furthermore, two proposed techniques presented in this paper use transformation of data to a unidimensional distance space to detect the outliers, so irrespective of the data’s high dimensions, the techniques remain computationally inexpensive and feasible. Comprehensive performance analysis of the proposed anomaly detection schemes is presented in the paper, and the newly proposed schemes are found better than the state-of-the-art methods when tested on several benchmark datasets.

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

Effective Representing of Information Network by Variational Autoencoder

10.24963/ijcai.2017/292 ◽

2017 ◽

Cited By ~ 3

Author(s):

Hang Li ◽

Haozheng Wang ◽

Zhenglu Yang ◽

Haochen Liu

Keyword(s):

State Of The Art ◽

Vector Model ◽

Information Network ◽

Network Representation ◽

Linear Relationships ◽

Variational Autoencoder ◽

Benchmark Datasets ◽

Text Information ◽

Complex Features ◽

Network representation is the basis of many applications and of extensive interest in various fields, such as information retrieval, social network analysis, and recommendation systems. Most previous methods for network representation only consider the incomplete aspects of a problem, including link structure, node information, and partial integration. The present study proposes a deep network representation model that seamlessly integrates the text information and structure of a network. Our model captures highly non-linear relationships between nodes and complex features of a network by exploiting the variational autoencoder (VAE), which is a deep unsupervised generation algorithm. We also merge the representation learned with a paragraph vector model and that learned with the VAE to obtain the network representation that preserves both structure and text information. We conduct comprehensive empirical experiments on benchmark datasets and find our model performs better than state-of-the-art techniques by a large margin.

Category Trees - Classifiers that Branch on Category

International Journal of Artificial Intelligence & Applications ◽

10.5121/ijaia.2021.12606 ◽

2021 ◽

Vol 12 (06) ◽

pp. 65-76

Author(s):

Kieran Greer

Keyword(s):

State Of The Art ◽

The State ◽

Biological Analogy ◽

Category Type ◽

Benchmark Datasets ◽

Batch Input ◽

Distinguishing Features ◽

Incorrect Data ◽

This paper presents a batch classifier that splits a dataset into tree branches depending on the category type. It has been improved from the earlier version and fixed a mistake in the earlier paper. Two important changes have been made. The first is to represent each category with a separate classifier. Each classifier then classifies its own subset of data rows, using batch input values to create the centroid and also represent the category itself. If the classifier contains data from more than one category however, it needs to create new classifiers for the incorrect data. The second change therefore is to allow the classifier to branch to new layers when there is a split in the data, and create new classifiers there for the data rows that are incorrectly classified. Each layer can therefore branch like a tree - not for distinguishing features, but for distinguishing categories. The paper then suggests a further innovation, which is to represent some data columns with fixed value ranges, or bands. When considering features, it is shown that some of the data can be classified directly through fixed value ranges, while the rest must be classified using a classifier technique and the idea allows the paper to discuss a biological analogy with neurons and neuron links. Tests show that the method can successfully classify a diverse set of benchmark datasets to better than the state-of-the-art.

Allowing mutations in maximal matches boosts genome compression performance

Bioinformatics ◽

10.1093/bioinformatics/btaa572 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4675-4681 ◽

Cited By ~ 1

Author(s):

Yuansheng Liu ◽

Limsoon Wong ◽

Jinyan Li

Keyword(s):

Data Storage ◽

State Of The Art ◽

Dna Bases ◽

Supplementary Information ◽

Maximal Match ◽

Compression Performance ◽

Genome Data ◽

Compression Speed ◽

Benchmark Datasets ◽

Abstract Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission. Availability and implementation https://github.com/yuansliu/memRGC. Supplementary information Supplementary data are available at Bioinformatics online.

A New End-to-End Multi-Dimensional CNN Framework for Land Cover/Land Use Change Detection in Multi-Source Remote Sensing Datasets

Remote Sensing ◽

10.3390/rs12122010 ◽

2020 ◽

Vol 12 (12) ◽

pp. 2010 ◽

Cited By ~ 5

Author(s):

Seyd Teymoor Seydi ◽

Mahdi Hasanlou ◽

Meisam Amani

Keyword(s):

Remote Sensing ◽

Change Detection ◽

State Of The Art ◽

Polarimetric Synthetic Aperture Radar ◽

Different Types ◽

Benchmark Datasets ◽

Convolution Kernels ◽

2D And 3D ◽

Accuracy Indices ◽

The diversity of change detection (CD) methods and the limitations in generalizing these techniques using different types of remote sensing datasets over various study areas have been a challenge for CD applications. Additionally, most CD methods have been implemented in two intensive and time-consuming steps: (a) predicting change areas, and (b) decision on predicted areas. In this study, a novel CD framework based on the convolutional neural network (CNN) is proposed to not only address the aforementioned problems but also to considerably improve the level of accuracy. The proposed CNN-based CD network contains three parallel channels: the first and second channels, respectively, extract deep features on the original first- and second-time imagery and the third channel focuses on the extraction of change deep features based on differencing and staking deep features. Additionally, each channel includes three types of convolution kernels: 1D-, 2D-, and 3D-dilated-convolution. The effectiveness and reliability of the proposed CD method are evaluated using three different types of remote sensing benchmark datasets (i.e., multispectral, hyperspectral, and Polarimetric Synthetic Aperture RADAR (PolSAR)). The results of the CD maps are also evaluated both visually and statistically by calculating nine different accuracy indices. Moreover, the results of the CD using the proposed method are compared to those of several state-of-the-art CD algorithms. All the results prove that the proposed method outperforms the other remote sensing CD techniques. For instance, considering different scenarios, the Overall Accuracies (OAs) and Kappa Coefficients (KCs) of the proposed CD method are better than 95.89% and 0.805, respectively, and the Miss Detection (MD) and the False Alarm (FA) rates are lower than 12% and 3%, respectively.

Accurate Sequence-Based Prediction of Deleterious nsSNPs with Multiple Sequence Profiles and Putative Binding Residues

Biomolecules ◽

10.3390/biom11091337 ◽

2021 ◽

Vol 11 (9) ◽

pp. 1337

Author(s):

Ruiyang Song ◽

Baixin Cao ◽

Zhenling Peng ◽

Christopher J. Oldfield ◽

Lukasz Kurgan ◽

...

Keyword(s):

State Of The Art ◽

Predictive Performance ◽

Nucleotide Polymorphisms ◽

Multiple Sequence ◽

Predictive Quality ◽

Alignment Algorithms ◽

Benchmark Datasets ◽

Binding Residues ◽

Sequence Profiles

Non-synonymous single nucleotide polymorphisms (nsSNPs) may result in pathogenic changes that are associated with human diseases. Accurate prediction of these deleterious nsSNPs is in high demand. The existing predictors of deleterious nsSNPs secure modest levels of predictive performance, leaving room for improvements. We propose a new sequence-based predictor, DMBS, which addresses the need to improve the predictive quality. The design of DMBS relies on the observation that the deleterious mutations are likely to occur at the highly conserved and functionally important positions in the protein sequence. Correspondingly, we introduce two innovative components. First, we improve the estimates of the conservation computed from the multiple sequence profiles based on two complementary databases and two complementary alignment algorithms. Second, we utilize putative annotations of functional/binding residues produced by two state-of-the-art sequence-based methods. These inputs are processed by a random forests model that provides favorable predictive performance when empirically compared against five other machine-learning algorithms. Empirical results on four benchmark datasets reveal that DMBS achieves AUC > 0.94, outperforming current methods, including protein structure-based approaches. In particular, DMBS secures AUC = 0.97 for the SNPdbe and ExoVar datasets, compared to AUC = 0.70 and 0.88, respectively, that were obtained by the best available methods. Further tests on the independent HumVar dataset shows that our method significantly outperforms the state-of-the-art method SNPdryad. We conclude that DMBS provides accurate predictions that can effectively guide wet-lab experiments in a high-throughput manner.

Theme-weighted Ranking of Keywords from Text Documents using Phrase Embeddings

10.31219/osf.io/tkvap ◽

2018 ◽

Cited By ~ 2

Author(s):

Debanjan Mahata ◽

John Kuriakose ◽

Rajiv Ratn Shah ◽

Roger Zimmermann ◽

John R. Talburt

Keyword(s):

State Of The Art ◽

Keyword Extraction ◽

Personalized Pagerank ◽

Text Documents ◽

Pagerank Algorithm ◽

Scientific Papers ◽

Benchmark Datasets ◽

Evaluation Dataset ◽

And Training ◽

Keyword extraction is a fundamental task in naturallanguage processing that facilitates mapping of documents to a concise set of representative single and multi-word phrases. Keywords from text documents are primarily extracted using supervised and unsupervised approaches. In this paper, we present an unsupervised technique that uses a combination of theme-weighted personalized PageRank algorithm and neural phrase embeddings for extracting and ranking keywords. Wealso introduce an efficient way of processing text documents and training phrase embeddings using existing techniques. We share an evaluation dataset derived from an existing dataset that is used for choosing the underlying embedding model. The evaluations for ranked keyword extraction are performed on two benchmark datasets comprising of short abstracts (Inspec), and long scientific papers (SemEval 2010), and is shown to produce results better than the state-of-the-art systems.

GIS-Based Machine Learning Algorithms for Gully Erosion Susceptibility Mapping in a Semi-Arid Region of Iran

Remote Sensing ◽

10.3390/rs12152478 ◽

2020 ◽

Vol 12 (15) ◽

pp. 2478 ◽

Cited By ~ 8

Author(s):

Xinxiang Lei ◽

Wei Chen ◽

Mohammadtaghi Avand ◽

Saeid Janizadeh ◽

Narges Kariminejad ◽

...

Keyword(s):

State Of The Art ◽

Gully Erosion ◽

Susceptibility Mapping ◽

Training Dataset ◽

Validation Process ◽

Conditioning Factors ◽

Susceptibility Model ◽

Semi Arid Region ◽

In the present study, gully erosion susceptibility was evaluated for the area of the Robat Turk Watershed in Iran. The assessment of gully erosion susceptibility was performed using four state-of-the-art data mining techniques: random forest (RF), credal decision trees (CDTree), kernel logistic regression (KLR), and best-first decision tree (BFTree). To the best of our knowledge, the KLR and CDTree algorithms have been rarely applied to gully erosion modeling. In the first step, from the 242 gully erosion locations that were identified, 70% (170 gullies) were selected as the training dataset, and the other 30% (72 gullies) were considered for the result validation process. In the next step, twelve gully erosion conditioning factors, including topographic, geomorphological, environmental, and hydrologic factors, were selected to estimate gully erosion susceptibility. The area under the ROC curve (AUC) was used to estimate the performance of the models. The results revealed that the RF model had the best performance (AUC = 0.893), followed by the KLR (AUC = 0.825), the CDTree (AUC = 0.808), and the BFTree (AUC = 0.789) models. Overall, the RF model performed significantly better than the others, which may support the application of this method to a transferable susceptibility model in other areas. Therefore, we suggest using the RF, KLR, and CDT models for gully erosion susceptibility mapping in other prone areas to assess their reproducibility.

Exploring the Use of Machine Learning to Automate the Qualitative Coding of Church-related Tweets

Fieldwork in Religion ◽

10.1558/firn.40610 ◽

2020 ◽

Vol 14 (2) ◽

pp. 140-159

Author(s):

Anthony-Paul Cooper ◽

Emmanuel Awuni Kolog ◽

Erkki Sutinen

Keyword(s):

Machine Learning ◽

Online Community ◽

High Volume ◽

Supervised Machine Learning ◽

Social Media Data ◽

Twitter Data ◽

Resource Intensity ◽

Media Data ◽

This article builds on previous research around the exploration of the content of church-related tweets. It does so by exploring whether the qualitative thematic coding of such tweets can, in part, be automated by the use of machine learning. It compares three supervised machine learning algorithms to understand how useful each algorithm is at a classification task, based on a dataset of human-coded church-related tweets. The study finds that one such algorithm, Naïve-Bayes, performs better than the other algorithms considered, returning Precision, Recall and F-measure values which each exceed an acceptable threshold of 70%. This has far-reaching consequences at a time where the high volume of social media data, in this case, Twitter data, means that the resource-intensity of manual coding approaches can act as a barrier to understanding how the online community interacts with, and talks about, church. The findings presented in this article offer a way forward for scholars of digital theology to better understand the content of online church discourse.

Synthesizing Conjunctive & Disjunctive Linear Invariants by K-means++ and SVM

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/6/3 ◽

2020 ◽

Vol 17 (6) ◽

pp. 847-856

Author(s):

Shengbing Ren ◽

Xiang Zhang

Keyword(s):

Software Verification ◽

State Of The Art ◽

Positive Sample ◽

Support Vector ◽

Hoare Logic ◽

Excellent Performance ◽

Automated Software ◽

Inductive Invariants ◽

Linear Invariants

The problem of synthesizing adequate inductive invariants lies at the heart of automated software verification. The state-of-the-art machine learning algorithms for synthesizing invariants have gradually shown its excellent performance. However, synthesizing disjunctive invariants is a difficult task. In this paper, we propose a method k++ Support Vector Machine (SVM) integrating k-means++ and SVM to synthesize conjunctive and disjunctive invariants. At first, given a program, we start with executing the program to collect program states. Next, k++SVM adopts k-means++ to cluster the positive samples and then applies SVM to distinguish each positive sample cluster from all negative samples to synthesize the candidate invariants. Finally, a set of theories founded on Hoare logic are adopted to check whether the candidate invariants are true invariants. If the candidate invariants fail the check, we should sample more states and repeat our algorithm. The experimental results show that k++SVM is compatible with the algorithms for Intersection Of Half-space (IOH) and more efficient than the tool of Interproc. Furthermore, it is shown that our method can synthesize conjunctive and disjunctive invariants automatically