Optimization approach to the choice of explicable methods for detecting anomalies in homogeneous text collections

The problem of detecting anomalous documents in text collections is considered. The existing methods for detecting anomalies are not universal and do not show a stable result on different data sets. The accuracy of the results depends on the choice of parameters at each step of the problem solving algorithm process, and for different collections different sets of parameters are optimal. Not all of the existing algorithms for detecting anomalies work effectively with text data, which vector representation is characterized by high dimensionality with strong sparsity.The problem of finding anomalies is considered in the following statement: it is necessary to checking a new document uploaded to an applied intelligent information system for congruence with a homogeneous collection of documents stored in it. In such systems that process legal documents the following limitations are imposed on the anomaly detection methods: high accuracy, computational efficiency, reproducibility of results and explicability of the solution. Methods satisfying these conditions are investigated.The paper examines the possibility of evaluating text documents on the scale of anomaly by deliberately introducing a foreign document into the collection. A strategy for detecting novelty of the document in relation to the collection is proposed, which assumes a reasonable selection of methods and parameters. It is shown how the accuracy of the solution is affected by the choice of vectorization options, tokenization principles, dimensionality reduction methods and parameters of novelty detection algorithms.The experiment was conducted on two homogeneous collections of documents containing technical norms: standards in the field of information technology and railways. The following approaches were used: calculation of the anomaly index as the Hellinger distance between the distributions of the remoteness of documents to the center of the collection and to the foreign document; optimization of the novelty detection algorithms depending on the methods of vectorization and dimensionality reduction. The vector space was constructed using the TF-IDF transformation and ARTM topic modeling. The following algorithms have been tested: Isolation Forest, Local Outlier Factor and One-Class SVM (based on Support Vector Machine).The experiment confirmed the effectiveness of the proposed optimization strategy for determining the appropriate method for detecting anomalies for a given text collection. When searching for an anomaly in the context of topic clustering of legal documents, the Isolating Forest method is proved to be effective. When vectorizing documents using TF-IDF, it is advisable to choose the optimal dictionary parameters and use the One-Class SVM method with the corresponding feature space transformation function.

Download Full-text

Comparison of Novelty Detection Methods for Detection of Various Rotary Machinery Faults

Sensors ◽

10.3390/s21103536 ◽

2021 ◽

Vol 21 (10) ◽

pp. 3536

Author(s):

Jakub Górski ◽

Adam Jabłoński ◽

Mateusz Heesch ◽

Michał Dziendzikowski ◽

Ziemowit Dworakowski

Keyword(s):

Novelty Detection ◽

Feature Space ◽

Detection Methods ◽

Data Sets ◽

Similar Distribution ◽

Detection Algorithms ◽

Rotary Machinery ◽

Detection Approach ◽

Input Reconstruction ◽

Quantitative Indicators

Condition monitoring is an indispensable element related to the operation of rotating machinery. In this article, the monitoring system for the parallel gearbox was proposed. The novelty detection approach is used to develop the condition assessment support system, which requires data collection for a healthy structure. The measured signals were processed to extract quantitative indicators sensitive to the type of damage occurring in this type of structure. The indicator’s values were used for the development of four different novelty detection algorithms. Presented novelty detection models operate on three principles: feature space distance, probability distribution, and input reconstruction. One of the distance-based models is adaptive, adjusting to new data flowing in the form of a stream. The authors test the developed algorithms on experimental and simulation data with a similar distribution, using the training set consisting mainly of samples generated by the simulator. Presented in the article results demonstrate the effectiveness of the trained models on both data sets.

Download Full-text

Self-Learning Based Centrifugal Compressor Surge Mapping With Computationally Efficient Adaptive Asymmetric Support Vector Machine

Journal of Dynamic Systems Measurement and Control ◽

10.1115/1.4006219 ◽

2012 ◽

Vol 134 (5) ◽

Cited By ~ 1

Author(s):

Xin Wu ◽

Yaoyu Li

Keyword(s):

Support Vector Machine ◽

Feature Space ◽

Ambient Air ◽

Low Flow ◽

Detection Methods ◽

Support Vector ◽

Three Dimension ◽

Computationally Efficient ◽

Surge Line ◽

Self Learning

When an air compressor is operated at very low flow rate for a given discharge pressure, surge may occur, resulting in large oscillations in pressure and flow in the compressor. To prevent the damage of the compressor, on account of surge, the control strategy employed is typically to operate it below the surge line (a map of the conditions at which surge begins). Surge line is strongly affected by the ambient air conditions. Previous research has developed to derive data-driven surge maps based on an asymmetric support vector machine (ASVM). The ASVM penalizes the surge case with much greater cost to minimize the possibility of undetected surge. This paper concerns the development of adaptive ASVM based self-learning surge map modeling via the combination with signal processing techniques for surge detection. During the actual operation of a compressor after the ASVM based surge map is obtained with historic data, new surge points can be identified with the surge detection methods such as short-time Fourier transform or wavelet transform. The new surge point can be used to update the surge map. However, with increasing number of surge points, the complexity of support vector machine (SVM) would grow dramatically. In order to keep the surge map SVM at a relatively low dimension, an adaptive SVM modeling algorithm is developed to select the minimum set of necessary support vectors in a three-dimension feature space based on Gaussian curvature to guarantee a desirable classification between surge and nonsurge areas. The proposed method is validated by applying the surge test data obtained from a testbed compressor at a manufacturing plant.

Download Full-text

Evaluation of Image Forgery Detection Using Multi-Scale Weber Local Descriptors

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213015400163 ◽

2015 ◽

Vol 24 (04) ◽

pp. 1540016 ◽

Cited By ~ 18

Author(s):

Muhammad Hussain ◽

Sahar Qasem ◽

George Bebis ◽

Ghulam Muhammad ◽

Hatim Aboalsamh ◽

...

Keyword(s):

Digital Image ◽

Image Data ◽

Feature Space ◽

Detection Methods ◽

Support Vector ◽

Data Sets ◽

Forgery Detection ◽

Data Set ◽

Multi Scale ◽

Copy Move Forgery Detection

Due to the maturing of digital image processing techniques, there are many tools that can forge an image easily without leaving visible traces and lead to the problem of the authentication of digital images. Based on the assumption that forgery alters the texture micro-patterns in a digital image and texture descriptors can be used for modeling this change; we employed two stat-of-the-art local texture descriptors: multi-scale Weber's law descriptor (multi-WLD) and multi-scale local binary pattern (multi-LBP) for splicing and copy-move forgery detection. As the tamper traces are not visible to open eyes, so the chrominance components of an image encode these traces and were used for modeling tamper traces with the texture descriptors. To reduce the dimension of the feature space and get rid of redundant features, we employed locally learning based (LLB) algorithm. For identifying an image as authentic or tampered, Support vector machine (SVM) was used. This paper presents the thorough investigation for the validation of this forgery detection method. The experiments were conducted on three benchmark image data sets, namely, CASIA v1.0, CASIA v2.0, and Columbia color. The experimental results showed that the accuracy rate of multi-WLD based method was 94.19% on CASIA v1.0, 96.52% on CASIA v2.0, and 94.17% on Columbia data set. It is not only significantly better than multi-LBP based method, but also it outperforms other stat-of-the-art similar forgery detection methods.

Download Full-text

Optimized Hybrid Heuristic Based Dimensionality Reduction Methods for Malaria Vector Using KNN Classifier

10.21203/rs.3.rs-107396/v1 ◽

2020 ◽

Author(s):

Micheal Olaolu Arowolo ◽

Marion Olubunmi Adebiyi ◽

Ayodele Ariyo Adebiyi ◽

Oludayo Olugbara

Keyword(s):

Gene Expression ◽

Dimensionality Reduction ◽

Principal Component ◽

Feature Space ◽

Component Analysis ◽

Rna Seq ◽

Knn Classifier ◽

Data Dimensionality Reduction ◽

Reduction Methods ◽

Mosquito Anopheles Gambiae

Abstract RNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is a capable addition to prevailing machine learning methods.

Download Full-text

Enhanced Dimensionality Reduction Methods for Classifying Malaria Vector Dataset using Decision Tree

Sains Malaysiana ◽

10.17576/jsm-2021-5009-07 ◽

2021 ◽

Vol 50 (9) ◽

pp. 2579-2589

Author(s):

Micheal Olaolu Arowolo ◽

Marion Olubunmi Adebiyi ◽

Ayodele Ariyo Adebiyi

Keyword(s):

Gene Expression ◽

Decision Tree ◽

Dimensionality Reduction ◽

Principal Component ◽

Feature Space ◽

Relevant Information ◽

Component Analysis ◽

Rna Seq ◽

Reduction Methods ◽

Mosquito Anopheles Gambiae

RNA-Seq data are utilized for biological applications and decision making for classification of genes. Lots of work in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in fetching relevant information in a given data. In this study, a novel optimized dimensionality reduction algorithm is proposed, by combining an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses Decision tree on the reduced mosquito anopheles gambiae dataset to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based from the high-dimensional input feature space. A feature ranking and earlier experience are used. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for feature selection and classification in gene expression data analysis and specify that the approach is a capable accumulation to prevailing data mining techniques.

Download Full-text

Dimensionality Reduction Methods Used in Machine Learning

Műszaki Tudományos Közlemények ◽

10.33894/mtk-2020.13.27 ◽

2020 ◽

Vol 13 (1) ◽

pp. 148-151

Author(s):

Kristóf Muhi ◽

Zsolt Csaba Johanyák

Keyword(s):

Machine Learning ◽

Missing Data ◽

Dimensionality Reduction ◽

Feature Space ◽

Data Preprocessing ◽

Short Review ◽

High Dimensional ◽

Data Types ◽

Reduction Methods ◽

The Individual

AbstractIn most cases, a dataset obtained through observation, measurement, etc. cannot be directly used for the training of a machine learning based system due to the unavoidable existence of missing data, inconsistencies and high dimensional feature space. Additionally, the individual features can contain quite different data types and ranges. For this reason, a data preprocessing step is nearly always necessary before the data can be used. This paper gives a short review of the typical methods applicable in the preprocessing and dimensionality reduction of raw data.

Download Full-text

Green machine learning via augmented Gaussian processes and multi-information source optimization

Soft Computing ◽

10.1007/s00500-021-05684-7 ◽

2021 ◽

Author(s):

Antonio Candelieri ◽

Riccardo Perego ◽

Francesco Archetti

Keyword(s):

Gaussian Process ◽

Information Sources ◽

Information Source ◽

Bayesian Optimization ◽

Computational Time ◽

Support Vector ◽

Svm Classifier ◽

Optimization Approach ◽

Optimization Strategy ◽

Large Dataset

AbstractSearching for accurate machine and deep learning models is a computationally expensive and awfully energivorous process. A strategy which has been recently gaining importance to drastically reduce computational time and energy consumed is to exploit the availability of different information sources, with different computational costs and different “fidelity,” typically smaller portions of a large dataset. The multi-source optimization strategy fits into the scheme of Gaussian Process-based Bayesian Optimization. An Augmented Gaussian Process method exploiting multiple information sources (namely, AGP-MISO) is proposed. The Augmented Gaussian Process is trained using only “reliable” information among available sources. A novel acquisition function is defined according to the Augmented Gaussian Process. Computational results are reported related to the optimization of the hyperparameters of a Support Vector Machine (SVM) classifier using two sources: a large dataset—the most expensive one—and a smaller portion of it. A comparison with a traditional Bayesian Optimization approach to optimize the hyperparameters of the SVM classifier on the large dataset only is reported.

Download Full-text

Optimized hybrid investigative based dimensionality reduction methods for malaria vector using KNN classifier

Journal Of Big Data ◽

10.1186/s40537-021-00415-z ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Micheal Olaolu Arowolo ◽

Marion Olubunmi Adebiyi ◽

Ayodele Ariyo Adebiyi ◽

Oludayo Olugbara

Keyword(s):

Gene Expression ◽

Dimensionality Reduction ◽

Principal Component ◽

Feature Space ◽

Component Analysis ◽

Rna Seq ◽

Knn Classifier ◽

Data Dimensionality Reduction ◽

Reduction Methods ◽

Mosquito Anopheles Gambiae

AbstractRNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is capable of adding to prevailing machine learning methods.

Download Full-text

Static and dynamic novelty detection methods for jet engine health monitoring

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2006.1931 ◽

2006 ◽

Vol 365 (1851) ◽

pp. 493-514 ◽

Cited By ~ 47

Author(s):

Paul Hayton ◽

Simukai Utete ◽

Dennis King ◽

Steve King ◽

Paul Anuzis ◽

...

Keyword(s):

Kalman Filter ◽

Support Vector Machine Model ◽

Novelty Detection ◽

Training Data ◽

Detection Methods ◽

Normal Operation ◽

Support Vector ◽

Rotating Shaft ◽

Machine Model ◽

Jet Engine

Novelty detection requires models of normality to be learnt from training data known to be normal. The first model considered in this paper is a static model trained to detect novel events associated with changes in the vibration spectra recorded from a jet engine. We describe how the distribution of energy across the harmonics of a rotating shaft can be learnt by a support vector machine model of normality. The second model is a dynamic model partially learnt from data using an expectation–maximization-based method. This model uses a Kalman filter to fuse performance data in order to characterize normal engine behaviour. Deviations from normal operation are detected using the normalized innovations squared from the Kalman filter.

Download Full-text

Comparison of Selected Dimensionality Reduction Methods for Detection of Root-Knot Nematode Infestations in Potato Tubers Using Hyperspectral Imaging

Sensors ◽

10.3390/s22010367 ◽

2022 ◽

Vol 22 (1) ◽

pp. 367

Author(s):

Janez Lapajne ◽

Matej Knapič ◽

Uroš Žibrat

Keyword(s):

Discriminant Analysis ◽

Dimensionality Reduction ◽

Linear Discriminant Analysis ◽

Hyperspectral Imaging ◽

Support Vector ◽

Potato Tubers ◽

Data Set ◽

Linear Discriminant ◽

Extreme Gradient Boosting ◽

Reduction Methods

Hyperspectral imaging is a popular tool used for non-invasive plant disease detection. Data acquired with it usually consist of many correlated features; hence most of the acquired information is redundant. Dimensionality reduction methods are used to transform the data sets from high-dimensional, to low-dimensional (in this study to one or a few features). We have chosen six dimensionality reduction methods (partial least squares, linear discriminant analysis, principal component analysis, RandomForest, ReliefF, and Extreme gradient boosting) and tested their efficacy on a hyperspectral data set of potato tubers. The extracted or selected features were pipelined to support vector machine classifier and evaluated. Tubers were divided into two groups, healthy and infested with Meloidogyne luci. The results show that all dimensionality reduction methods enabled successful identification of inoculated tubers. The best and most consistent results were obtained using linear discriminant analysis, with 100% accuracy in both potato tuber inside and outside images. Classification success was generally higher in the outside data set, than in the inside. Nevertheless, accuracy was in all cases above 0.6.

Download Full-text