scholarly journals Optimization approach to the choice of explicable methods for detecting anomalies in homogeneous text collections

Author(s):  
Fedor Krasnov ◽  
Irina Smaznevich ◽  
Elena Baskakova

  The problem of detecting anomalous documents in text collections is considered. The existing methods for detecting anomalies are not universal and do not show a stable result on different data sets. The accuracy of the results depends on the choice of parameters at each step of the problem solving algorithm process, and for different collections different sets of parameters are optimal. Not all of the existing algorithms for detecting anomalies work effectively with text data, which vector representation is characterized by high dimensionality with strong sparsity.The problem of finding anomalies is considered in the following statement: it is necessary to checking a new document uploaded to an applied intelligent information system for congruence with a homogeneous collection of documents stored in it. In such systems that process legal documents the following limitations are imposed on the anomaly detection methods: high accuracy, computational efficiency, reproducibility of results and explicability of the solution. Methods satisfying these conditions are investigated.The paper examines the possibility of evaluating text documents on the scale of anomaly by deliberately introducing a foreign document into the collection. A strategy for detecting novelty of the document in relation to the collection is proposed, which assumes a reasonable selection of methods and parameters. It is shown how the accuracy of the solution is affected by the choice of vectorization options, tokenization principles, dimensionality reduction methods and parameters of novelty detection algorithms.The experiment was conducted on two homogeneous collections of documents containing technical norms: standards in the field of information technology and railways. The following approaches were used: calculation of the anomaly index as the Hellinger distance between the distributions of the remoteness of documents to the center of the collection and to the foreign document; optimization of the novelty detection algorithms depending on the methods of vectorization and dimensionality reduction. The vector space was constructed using the TF-IDF transformation and ARTM topic modeling. The following algorithms have been tested: Isolation Forest, Local Outlier Factor and One-Class SVM (based on Support Vector Machine).The experiment confirmed the effectiveness of the proposed optimization strategy for determining the appropriate method for detecting anomalies for a given text collection. When searching for an anomaly in the context of topic clustering of legal documents, the Isolating Forest method is proved to be effective. When vectorizing documents using TF-IDF, it is advisable to choose the optimal dictionary parameters and use the One-Class SVM method with the corresponding feature space transformation function.

Sensors ◽  
2021 ◽  
Vol 21 (10) ◽  
pp. 3536
Author(s):  
Jakub Górski ◽  
Adam Jabłoński ◽  
Mateusz Heesch ◽  
Michał Dziendzikowski ◽  
Ziemowit Dworakowski

Condition monitoring is an indispensable element related to the operation of rotating machinery. In this article, the monitoring system for the parallel gearbox was proposed. The novelty detection approach is used to develop the condition assessment support system, which requires data collection for a healthy structure. The measured signals were processed to extract quantitative indicators sensitive to the type of damage occurring in this type of structure. The indicator’s values were used for the development of four different novelty detection algorithms. Presented novelty detection models operate on three principles: feature space distance, probability distribution, and input reconstruction. One of the distance-based models is adaptive, adjusting to new data flowing in the form of a stream. The authors test the developed algorithms on experimental and simulation data with a similar distribution, using the training set consisting mainly of samples generated by the simulator. Presented in the article results demonstrate the effectiveness of the trained models on both data sets.


Author(s):  
Xin Wu ◽  
Yaoyu Li

When an air compressor is operated at very low flow rate for a given discharge pressure, surge may occur, resulting in large oscillations in pressure and flow in the compressor. To prevent the damage of the compressor, on account of surge, the control strategy employed is typically to operate it below the surge line (a map of the conditions at which surge begins). Surge line is strongly affected by the ambient air conditions. Previous research has developed to derive data-driven surge maps based on an asymmetric support vector machine (ASVM). The ASVM penalizes the surge case with much greater cost to minimize the possibility of undetected surge. This paper concerns the development of adaptive ASVM based self-learning surge map modeling via the combination with signal processing techniques for surge detection. During the actual operation of a compressor after the ASVM based surge map is obtained with historic data, new surge points can be identified with the surge detection methods such as short-time Fourier transform or wavelet transform. The new surge point can be used to update the surge map. However, with increasing number of surge points, the complexity of support vector machine (SVM) would grow dramatically. In order to keep the surge map SVM at a relatively low dimension, an adaptive SVM modeling algorithm is developed to select the minimum set of necessary support vectors in a three-dimension feature space based on Gaussian curvature to guarantee a desirable classification between surge and nonsurge areas. The proposed method is validated by applying the surge test data obtained from a testbed compressor at a manufacturing plant.


2015 ◽  
Vol 24 (04) ◽  
pp. 1540016 ◽  
Author(s):  
Muhammad Hussain ◽  
Sahar Qasem ◽  
George Bebis ◽  
Ghulam Muhammad ◽  
Hatim Aboalsamh ◽  
...  

Due to the maturing of digital image processing techniques, there are many tools that can forge an image easily without leaving visible traces and lead to the problem of the authentication of digital images. Based on the assumption that forgery alters the texture micro-patterns in a digital image and texture descriptors can be used for modeling this change; we employed two stat-of-the-art local texture descriptors: multi-scale Weber's law descriptor (multi-WLD) and multi-scale local binary pattern (multi-LBP) for splicing and copy-move forgery detection. As the tamper traces are not visible to open eyes, so the chrominance components of an image encode these traces and were used for modeling tamper traces with the texture descriptors. To reduce the dimension of the feature space and get rid of redundant features, we employed locally learning based (LLB) algorithm. For identifying an image as authentic or tampered, Support vector machine (SVM) was used. This paper presents the thorough investigation for the validation of this forgery detection method. The experiments were conducted on three benchmark image data sets, namely, CASIA v1.0, CASIA v2.0, and Columbia color. The experimental results showed that the accuracy rate of multi-WLD based method was 94.19% on CASIA v1.0, 96.52% on CASIA v2.0, and 94.17% on Columbia data set. It is not only significantly better than multi-LBP based method, but also it outperforms other stat-of-the-art similar forgery detection methods.


2020 ◽  
Author(s):  
Micheal Olaolu Arowolo ◽  
Marion Olubunmi Adebiyi ◽  
Ayodele Ariyo Adebiyi ◽  
Oludayo Olugbara

Abstract RNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is a capable addition to prevailing machine learning methods.


2021 ◽  
Vol 50 (9) ◽  
pp. 2579-2589
Author(s):  
Micheal Olaolu Arowolo ◽  
Marion Olubunmi Adebiyi ◽  
Ayodele Ariyo Adebiyi

RNA-Seq data are utilized for biological applications and decision making for classification of genes. Lots of work in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in fetching relevant information in a given data. In this study, a novel optimized dimensionality reduction algorithm is proposed, by combining an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses Decision tree on the reduced mosquito anopheles gambiae dataset to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based from the high-dimensional input feature space. A feature ranking and earlier experience are used. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for feature selection and classification in gene expression data analysis and specify that the approach is a capable accumulation to prevailing data mining techniques.


2020 ◽  
Vol 13 (1) ◽  
pp. 148-151
Author(s):  
Kristóf Muhi ◽  
Zsolt Csaba Johanyák

AbstractIn most cases, a dataset obtained through observation, measurement, etc. cannot be directly used for the training of a machine learning based system due to the unavoidable existence of missing data, inconsistencies and high dimensional feature space. Additionally, the individual features can contain quite different data types and ranges. For this reason, a data preprocessing step is nearly always necessary before the data can be used. This paper gives a short review of the typical methods applicable in the preprocessing and dimensionality reduction of raw data.


2021 ◽  
Author(s):  
Antonio Candelieri ◽  
Riccardo Perego ◽  
Francesco Archetti

AbstractSearching for accurate machine and deep learning models is a computationally expensive and awfully energivorous process. A strategy which has been recently gaining importance to drastically reduce computational time and energy consumed is to exploit the availability of different information sources, with different computational costs and different “fidelity,” typically smaller portions of a large dataset. The multi-source optimization strategy fits into the scheme of Gaussian Process-based Bayesian Optimization. An Augmented Gaussian Process method exploiting multiple information sources (namely, AGP-MISO) is proposed. The Augmented Gaussian Process is trained using only “reliable” information among available sources. A novel acquisition function is defined according to the Augmented Gaussian Process. Computational results are reported related to the optimization of the hyperparameters of a Support Vector Machine (SVM) classifier using two sources: a large dataset—the most expensive one—and a smaller portion of it. A comparison with a traditional Bayesian Optimization approach to optimize the hyperparameters of the SVM classifier on the large dataset only is reported.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Micheal Olaolu Arowolo ◽  
Marion Olubunmi Adebiyi ◽  
Ayodele Ariyo Adebiyi ◽  
Oludayo Olugbara

AbstractRNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is capable of adding to prevailing machine learning methods.


Author(s):  
Paul Hayton ◽  
Simukai Utete ◽  
Dennis King ◽  
Steve King ◽  
Paul Anuzis ◽  
...  

Novelty detection requires models of normality to be learnt from training data known to be normal. The first model considered in this paper is a static model trained to detect novel events associated with changes in the vibration spectra recorded from a jet engine. We describe how the distribution of energy across the harmonics of a rotating shaft can be learnt by a support vector machine model of normality. The second model is a dynamic model partially learnt from data using an expectation–maximization-based method. This model uses a Kalman filter to fuse performance data in order to characterize normal engine behaviour. Deviations from normal operation are detected using the normalized innovations squared from the Kalman filter.


Sensors ◽  
2022 ◽  
Vol 22 (1) ◽  
pp. 367
Author(s):  
Janez Lapajne ◽  
Matej Knapič ◽  
Uroš Žibrat

Hyperspectral imaging is a popular tool used for non-invasive plant disease detection. Data acquired with it usually consist of many correlated features; hence most of the acquired information is redundant. Dimensionality reduction methods are used to transform the data sets from high-dimensional, to low-dimensional (in this study to one or a few features). We have chosen six dimensionality reduction methods (partial least squares, linear discriminant analysis, principal component analysis, RandomForest, ReliefF, and Extreme gradient boosting) and tested their efficacy on a hyperspectral data set of potato tubers. The extracted or selected features were pipelined to support vector machine classifier and evaluated. Tubers were divided into two groups, healthy and infested with Meloidogyne luci. The results show that all dimensionality reduction methods enabled successful identification of inoculated tubers. The best and most consistent results were obtained using linear discriminant analysis, with 100% accuracy in both potato tuber inside and outside images. Classification success was generally higher in the outside data set, than in the inside. Nevertheless, accuracy was in all cases above 0.6.


Sign in / Sign up

Export Citation Format

Share Document