МОДЕЛЬ І АЛГОРИТМ НАВЧАННЯ ДЕТЕКТОРА ШКІДЛИВОГО ТРАФІКУ НА ОСНОВІ МОДИФІКАЦІЇ ЗРОСТАЮЧОГО НЕЙРОННОГО ГАЗУ

It is proposed the model of the hierarchical convolutional extractor of malware traffic features. Image with resolution 28x28 pixels and 10-th channels formed on the basis of successive 10 network packet flows is considered as model input. It allows to describe the spatial-temporal statistical characteristics of the traffic. The feature extractor consists of two convolutional layers with three-dimensional filters, sub-sampling layers, and activation calculation layers based on the orthogonal matching pursuit algorithm and the ReLU function. It is proposed the model of decision rules of the malware traffic detector based on information-extreme classifier. It allows to receive computatially simple decision rules and evaluate the informational efficiency of the feature extractor in the condition of the limited volume of the relevant labeled training dataset. The classifier performs an adaptive feature discretization and construction of the optimal in the information sense of radial-basis containers of classes in binary Hamming space. An information criterion of learning efficiency is the modification of S. Kulbak's measure as a function of the frequency of errors of the first and second type. Growing neural gas algorithm for pretraining of the feature extractor is improved by modifying the mechanism of insertion and updating of neurons. It allows utilizing unlabeled training samples and obtaining the optimal distribution of neurons to cover the training sample. Modification of the mechanism of insertion of new neurons is to form a new neuron at the reach of the threshold, and not with a given frequency. It allows you to improve the stability of the learning process and regulate the generalization ability of the model. The modification of the mechanism for updating the weighting coefficients of the neurons is to use the of Oja's rule instead of the Hebb's rule, which allows to avoid uncontrolled growth of neuron weights and adapts convolutional filters for sparse coding of input observation. It is proposed meta-heuristic search algorithm of simulated annealing for the training of decision rules and fine-tuning high-level filters of feature extractor. Simulation results using CTU-Mixed and CTU-13 datasets confirm the effectiveness of the resulting decision rules for recognizing the malware traffic from test samples

Download Full-text

БАГАТОШАРОВА МОДЕЛЬ ТА МЕТОД НАВЧАННЯ ДЛЯ ДЕТЕКТУВАННЯ ШКІДЛИВОГО ТРАФІКУ НА ОСНОВІ АНСАМБЛЮ ДЕРЕВ РІШЕНЬ

RADIOELECTRONIC AND COMPUTER SYSTEMS ◽

10.32620/reks.2020.2.08 ◽

2020 ◽

pp. 92-101

Author(s):

В’ячеслав Васильович Москаленко ◽

Микола Олександрович Зарецький ◽

Альона Сергіївна Москаленко ◽

Антон Михайлович Кудрявцев ◽

Віктор Анатолійович Семашко

Keyword(s):

Sparse Coding ◽

Matching Pursuit ◽

Decision Rules ◽

Training Data ◽

Maximum Efficiency ◽

Explanatory Factors ◽

Feature Extractor ◽

Knowledge Distillation ◽

Traffic Detector ◽

Regularized Method

The model and training method of multilayer feature extractor and decision rules for a malware traffic detector is proposed. The feature extractor model is based on a convolutional sparse coding network whose sparse encoder is approximated by a regression random forest model according to the principles of knowledge distillation. In this case, an algorithm of growing sparse coding neural gas has been developed for unsupervised training the features extractor with automatic determination of the required number of features on each layer. As for feature extractor, at the training phase to implement of sparse coding the greedy L1-regularized method of Orthogonal Matching Pursuit was used, and at the knowledge distillation phase, the L1-regularized method at the least angles (Least regression algorithm) was additionally used. Due to the explaining-away effect, the extracted features are uncorrelated and robust to noise and adversarial attacks. The proposed feature extractor is unsupervised trained to separate the explanatory factors and allows to use the unlabeled training data, which are usually quite large, with the maximum efficiency. As a model of the decision rules proposed to use the binary encoder of input observations based on an ensemble of decision trees and information-extreme closed hyper-surfaces (containers) for class separation, that are recovery in radial-basis of Hemming' binary space. The addition of coding trees is based on the boosting principle, and the radius of class containers is optimized by direct search. The information-extreme classifier is characterized by low computational complexity and high generalization capacity for small sets of labeled training data. The verification results of the trained model on open CTU test data sets confirm the suitability of the proposed algorithms for practical application since the accuracy of malware traffic detection is 96.1 %.

Download Full-text

A Structure-Adaptive Matching Pursuit Subspace Search Algorithm for Effective Image Sparse Representation

Chinese Journal of Computers ◽

10.3724/sp.j.1016.2012.01751 ◽

2012 ◽

Vol 35 (8) ◽

pp. 1751

Author(s):

Yu-Bao SUN ◽

Liang XIAO ◽

Zhi-Hui WEI ◽

Qing-Shan LIU

Keyword(s):

Sparse Representation ◽

Search Algorithm ◽

Matching Pursuit ◽

Subspace Search

Download Full-text

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01459-0 ◽

2021 ◽

Vol 21 (S2) ◽

Author(s):

Feihong Yang ◽

Xuwen Wang ◽

Hetong Ma ◽

Jiao Li

Keyword(s):

Language Processing ◽

Pearson Correlation ◽

Fine Tuning ◽

Entity Recognition ◽

Training Dataset ◽

Training Methods ◽

Code Size ◽

Model Framework ◽

Language Understanding ◽

Medical Language

Abstract Background Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep learning, we proposed an easy to learn Python toolkit named transformers-sklearn. By wrapping the interfaces of transformers in only three functions (i.e., fit, score, and predict), transformers-sklearn combines the advantages of the transformers and scikit-learn toolkits. Methods In transformers-sklearn, three Python classes were implemented, namely, BERTologyClassifier for the classification task, BERTologyNERClassifier for the named entity recognition (NER) task, and BERTologyRegressor for the regression task. Each class contains three methods, i.e., fit for fine-tuning transformer-based models with the training dataset, score for evaluating the performance of the fine-tuned model, and predict for predicting the labels of the test dataset. transformers-sklearn is a user-friendly toolkit that (1) Is customizable via a few parameters (e.g., model_name_or_path and model_type), (2) Supports multilingual NLP tasks, and (3) Requires less coding. The input data format is automatically generated by transformers-sklearn with the annotated corpus. Newcomers only need to prepare the dataset. The model framework and training methods are predefined in transformers-sklearn. Results We collected four open-source medical language datasets, including TrialClassification for Chinese medical trial text multi label classification, BC5CDR for English biomedical text name entity recognition, DiabetesNER for Chinese diabetes entity recognition and BIOSSES for English biomedical sentence similarity estimation. In the four medical NLP tasks, the average code size of our script is 45 lines/task, which is one-sixth the size of transformers’ script. The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers. Conclusions The proposed toolkit could help newcomers address medical language understanding tasks using the scikit-learn coding style easily. The code and tutorials of transformers-sklearn are available at https://doi.org/10.5281/zenodo.4453803. In future, more medical language understanding tasks will be supported to improve the applications of transformers_sklearn.

Download Full-text

SARIMA Approach to Generating Synthetic Monthly Rainfall in the Sinú River Watershed in Colombia

Atmosphere ◽

10.3390/atmos11060602 ◽

2020 ◽

Vol 11 (6) ◽

pp. 602

Author(s):

Luisa Martínez-Acosta ◽

Juan Pablo Medrano-Barboza ◽

Álvaro López-Ramos ◽

John Freddy Remolina López ◽

Álvaro Alberto López-Lambraño

Keyword(s):

Time Series ◽

Water Resources ◽

Autocorrelation Function ◽

Information Criterion ◽

Monthly Rainfall ◽

Statistical Characteristics ◽

Rainfall Time Series ◽

River Watershed ◽

Statistical Measures ◽

Sarima Models

Seasonal Auto Regressive Integrative Moving Average models (SARIMA) were developed for monthly rainfall time series. Normality of the rainfall time series was achieved by using the Box Cox transformation. The best SARIMA models were selected based on their autocorrelation function (ACF), partial autocorrelation function (PACF), and the minimum values of the Akaike Information Criterion (AIC). The result of the Ljung–Box statistical test shows the randomness and homogeneity of each model residuals. The performance and validation of the SARIMA models were evaluated based on various statistical measures, among these, the Student’s t-test. It is possible to obtain synthetic records that preserve the statistical characteristics of the historical record through the SARIMA models. Finally, the results obtained can be applied to various hydrological and water resources management studies. This will certainly assist policy and decision-makers to establish strategies, priorities, and the proper use of water resources in the Sinú river watershed.

Download Full-text

Covid-19 detection via deep neural network and occlusion sensitivity maps

10.36227/techrxiv.14100890 ◽

2021 ◽

Author(s):

Noor Ahmad ◽

Muhammad Aminu ◽

Mohd Halim Mohd Noor

Keyword(s):

Neural Network ◽

Deep Learning ◽

Deep Neural Network ◽

State Of The Art ◽

Color Images ◽

Fine Tuning ◽

Training Dataset ◽

Learning Approaches ◽

Learning Models ◽

Sensitivity Maps

Deep learning approaches have attracted a lot of attention in the automatic detection of Covid-19 and transfer learning is the most common approach. However, majority of the pre-trained models are trained on color images, which can cause inefficiencies when fine-tuning the models on Covid-19 images which are often grayscale. To address this issue, we propose a deep learning architecture called CovidNet which requires a relatively smaller number of parameters. CovidNet accepts grayscale images as inputs and is suitable for training with limited training dataset. Experimental results show that CovidNet outperforms other state-of-the-art deep learning models for Covid-19 detection.

Download Full-text

AIRBORNE HYPERSPECTRAL REMOTE SENSING FOR IDENTIFICATION GRASSLAND VEGETATION

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprsarchives-xl-3-w3-427-2015 ◽

2015 ◽

Vol XL-3/W3 ◽

pp. 427-431 ◽

Cited By ~ 1

Author(s):

P. Burai ◽

T. Tomor ◽

L. Bekő ◽

B. Deák

Keyword(s):

Image Classification ◽

Learning Algorithm ◽

Training Sample ◽

Hyperspectral Data ◽

Training Dataset ◽

Classification Methods ◽

Grassland Vegetation ◽

Training Samples ◽

Almost All ◽

Noise Fraction

In our study we classified grassland vegetation types of an alkali landscape (Eastern Hungary), using different image classification methods for hyperspectral data. Our aim was to test the applicability of hyperspectral data in this complex system using various image classification methods. To reach the highest classification accuracy, we compared the performance of traditional image classifiers, machine learning algorithm, feature extraction (MNF-transformation) and various sizes of training dataset. Hyperspectral images were acquired by an AISA EAGLE II hyperspectral sensor of 128 contiguous bands (400–1000 nm), a spectral sampling of 5 nm bandwidth and a ground pixel size of 1 m. We used twenty vegetation classes which were compiled based on the characteristic dominant species, canopy height, and total vegetation cover. Image classification was applied to the original and MNF (minimum noise fraction) transformed dataset using various training sample sizes between 10 and 30 pixels. In the case of the original bands, both SVM and RF classifiers provided high accuracy for almost all classes irrespectively of the number of the training pixels. We found that SVM and RF produced the best accuracy with the first nine MNF transformed bands. Our results suggest that in complex open landscapes, application of SVM can be a feasible solution, as this method provides higher accuracies compared to RF and MLC. SVM was not sensitive for the size of the training samples, which makes it an adequate tool for cases when the available number of training pixels are limited for some classes.

Download Full-text

Fine-tuning Polygenic Risk Scores with GWAS Summary Statistics

10.1101/810713 ◽

2019 ◽

Cited By ~ 4

Author(s):

Zijie Zhao ◽

Yanyao Yi ◽

Yuchang Wu ◽

Xiaoyuan Zhong ◽

Yupei Lin ◽

...

Keyword(s):

Association Studies ◽

Fine Tuning ◽

Risk Scores ◽

Training Dataset ◽

Validation Dataset ◽

P Value ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Polygenic Risk ◽

Model Tuning

AbstractPolygenic risk scores (PRSs) have wide applications in human genetics research. Notably, most PRS models include tuning parameters which improve predictive performance when properly selected. However, existing model-tuning methods require individual-level genetic data as the training dataset or as a validation dataset independent from both training and testing samples. These data rarely exist in practice, creating a significant gap between PRS methodology and applications. Here, we introduce PUMAS (Parameter-tuning Using Marginal Association Statistics), a novel method to fine-tune PRS models using summary statistics from genome-wide association studies (GWASs). Through extensive simulations, external validations, and analysis of 65 traits, we demonstrate that PUMAS can perform a variety of model-tuning procedures (e.g. cross-validation) using GWAS summary statistics and can effectively benchmark and optimize PRS models under diverse genetic architecture. On average, PUMAS improves the predictive R2 by 205.6% and 62.5% compared to PRSs with arbitrary p-value cutoffs of 0.01 and 1, respectively. Applied to 211 neuroimaging traits and Alzheimer’s disease, we show that fine-tuned PRSs will significantly improve statistical power in downstream association analysis. We believe our method resolves a fundamental problem without a current solution and will greatly benefit genetic prediction applications.

Download Full-text

Combining Self-supervised Learning and Active Learning for Disfluency Detection

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3487290 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-25

Author(s):

Shaolei Wang ◽

Zhongyuan Wang ◽

Wanxiang Che ◽

Sendong Zhao ◽

Ting Liu

Keyword(s):

Neural Network ◽

Active Learning ◽

Supervised Learning ◽

Large Scale ◽

Training Data ◽

Fine Tuning ◽

Training Dataset ◽

Performance Gap ◽

Annotation Costs ◽

Trained Neural Network

Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.

Download Full-text

Algorithm for Automated Generation of a Training Sample for Solving the Problem of Determining Semantic Similarity between a Pair of Keywords using Machine Learning Methods

PROGRAMMNAYA INGENERIA ◽

10.17587/prin.12.283-294 ◽

2021 ◽

Vol 12 (6) ◽

pp. 283-294

Author(s):

K. V. Lunev ◽

Keyword(s):

Machine Learning ◽

Semantic Similarity ◽

Training Sample ◽

Subject Area ◽

Training Dataset ◽

Training Set ◽

Automated Generation ◽

Machine Learning Methods ◽

Subjective Value ◽

The Subject

Currently, machine learning is an effective approach to solving many problems of information-analytical systems. To use such approaches, a training set of examples is required. Collecting a training dataset is usually a time-consuming process. Its implementation requires the participation of several experts in the subject area for which the training set is collected. Moreover, for some tasks, including the task of determining the semantic similarity of keyword pairs, it is difficult even to correctly draw up instructions for experts to adequately evaluate the test examples. The reason for such difficulties is that semantic similarity is a subjective value and strongly depends on the scope, context, person, and task. The article presents the results of research on the search for models, algorithms and software tools for the automated formation of objects of the training sample in the problem of determining the semantic similarity of a pair of words. In addition, models built on an automated training sample allow us to solve not only the problem of determining semantic similarity, but also an arbitrary problem of classifying edges of a graph. The methods used in this paper are based on graph theory algorithms.

Download Full-text

Utilizing Transfer Learning and Homomorphic Encryption in a Privacy Preserving and Secure Biometric Recognition System

Computers ◽

10.3390/computers8010003 ◽

2018 ◽

Vol 8 (1) ◽

pp. 3 ◽

Cited By ~ 10

Author(s):

Milad Salem ◽

Shayan Taheri ◽

Jiann-Shiun Yuan

Keyword(s):

Transfer Learning ◽

Search Algorithm ◽

Homomorphic Encryption ◽

Recognition System ◽

Privacy Preserving ◽

Modern World ◽

Biometric Data ◽

Liveness Detection ◽

Biometric Verification ◽

Feature Extractor

Biometric verification systems have become prevalent in the modern world with the wide usage of smartphones. These systems heavily rely on storing the sensitive biometric data on the cloud. Due to the fact that biometric data like fingerprint and iris cannot be changed, storing them on the cloud creates vulnerability and can potentially have catastrophic consequences if these data are leaked. In the recent years, in order to preserve the privacy of the users, homomorphic encryption has been used to enable computation on the encrypted data and to eliminate the need for decryption. This work presents DeepZeroID: a privacy-preserving cloud-based and multiple-party biometric verification system that uses homomorphic encryption. Via transfer learning, training on sensitive biometric data is eliminated and one pre-trained deep neural network is used as feature extractor. By developing an exhaustive search algorithm, this feature extractor is applied on the tasks of biometric verification and liveness detection. By eliminating the need for training on and decrypting the sensitive biometric data, this system preserves privacy, requires zero knowledge of the sensitive data distribution, and is highly scalable. Our experimental results show that DeepZeroID can deliver 95.47% F1 score in the verification of combined iris and fingerprint feature vectors with zero true positives and with a 100% accuracy in liveness detection.

Download Full-text