Improved Fake Reviews Detection Model Based on Vertical Ensemble Tri-Training and Active Learning

2021 ◽  
Vol 12 (3) ◽  
pp. 1-19
Author(s):  
Chunyong Yin ◽  
Haoqi Cuan ◽  
Yuhang Zhu ◽  
Zhichao Yin

People’s increasingly frequent online activity has generated a large number of reviews, whereas fake reviews can mislead users and harm their personal interests. In addition, it is not feasible to label reviews on a large scale because of the high cost of manual labeling. Therefore, to improve the detection performance by utilizing the unlabeled reviews, this article proposes a fake reviews detection model based on vertical ensemble tri-training and active learning (VETT-AL). The model combines the features of review text with the user behavior features as feature extraction. In the VETT-AL algorithm, the iterative process is divided into two parts: vertical integration within the group and horizontal integration among the groups. The intra-group integration is to integrate three original classifiers by using the previous iterative models of the classifiers. The inter-group integration is to adopt the active learning based on entropy to select the data with the highest confidence and label it, and as the result of that, the second generation classifiers are trained by the traditional process to improve the accuracy of the label. Experimental results show that the proposed model has a good classification performance.


Author(s):  
G. Matasci ◽  
J. Plante ◽  
K. Kasa ◽  
P. Mousavi ◽  
A. Stewart ◽  
...  

Abstract. We present a deep learning-based vessel detection and (re-)identification approach from spaceborne optical images. We introduce these two components as part of a maritime surveillance from space pipeline and present experimental results on challenging real-world maritime datasets derived from WorldView imagery. First, we developed a vessel detection model based on RetinaNet achieving a performance of 0.795 F1-score on a challenging multi-scale dataset. We then collected a large-scale dataset for vessel identification by applying the detection model on 200+ optical images, detecting the vessels therein and assigning them an identity via an Automatic Identification System association framework. A vessel re-identification model based on Twin neural networks has then been trained on this dataset featuring 2500+ unique vessels with multiple repeated occurrences across different acquisitions. The model allows to naturally establish similarities between vessel images. It returns a relevant ranking of candidate vessels from a database when provided an input image for a specific vessel the user might be interested in, with top-1 and top-10 accuracies of 38.7% and 76.5%, respectively. This study demonstrates the potential offered by the latest advances in deep learning and computer vision when applied to optical remote sensing imagery in a maritime context, opening new opportunities for automated vessel monitoring and tracking capabilities from space.



2019 ◽  
Vol 13 (9) ◽  
pp. 1401-1409 ◽  
Author(s):  
Xu Li ◽  
Lin Hong ◽  
Jian-chun Wang ◽  
Xiang Liu


Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-17
Author(s):  
Fuguang Bao ◽  
Yongqiang Wu ◽  
Zhaogang Li ◽  
Yongzhao Li ◽  
Lili Liu ◽  
...  

High-dimensional and unbalanced data anomaly detection is common. Effective anomaly detection is essential for problem or disaster early warning and maintaining system reliability. A significant research issue related to the data analysis of the sensor is the detection of anomalies. The anomaly detection is essentially an unbalanced sequence binary classification. The data of this type contains characteristics of large scale, high complex computation, unbalanced data distribution, and sequence relationship among data. This paper uses long short-term memory networks (LSTMs) combined with historical sequence data; also, it integrates the synthetic minority oversampling technique (SMOTE) algorithm and K-nearest neighbors (kNN), and it designs and constructs an anomaly detection network model based on kNN-SMOTE-LSTM in accordance with the data characteristic of being unbalanced. This model can continuously filter out and securely generate samples to improve the performance of the model through kNN discriminant classifier and avoid the blindness and limitations of the SMOTE algorithm in generating new samples. The experiments demonstrated that the structured kNN-SMOTE-LSTM model can significantly improve the performance of the unbalanced sequence binary classification.



2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Yong Fang ◽  
Mingyu Xie ◽  
Cheng Huang

Application security is essential in today’s highly development period. Backdoor is a means by which attackers can invade the system to achieve illegal purposes and damage users’ rights. It has posed a serious threat to network security. Thus, it is urgent to take adequate measures to defend such attacks. Previous research work was mainly focused on numerous PHP webshells, with less research on Python backdoor files. Language differences make the method not entirely applicable. This paper proposes a Python backdoor detection model named PBDT based on combined features. The model summarizes the common functional modules and functions in the backdoor files and extracts the number of calls in the text to form sample features. What is more, we consider the text’s statistical characteristics, including the information entropy, the longest string, etc., to identify the obfuscated Python code. Besides, the opcode sequence is used to represent code characteristics, such as TF-IDF vector and FastText classifier, to eliminate the influence of interference items. Finally, we introduce the Random Forest algorithm to build a classifier. Covering most types of backdoors, some samples are obfuscated, the model achieves an accuracy of 97.70%, and the TNR index is as high as 98.66%, showing a good classification performance in Python backdoor detection.



2014 ◽  
Vol 19 (5) ◽  
pp. 685-695 ◽  
Author(s):  
Kevin Smith ◽  
Peter Horvath

High-content screening is a powerful method to discover new drugs and carry out basic biological research. Increasingly, high-content screens have come to rely on supervised machine learning (SML) to perform automatic phenotypic classification as an essential step of the analysis. However, this comes at a cost, namely, the labeled examples required to train the predictive model. Classification performance increases with the number of labeled examples, and because labeling examples demands time from an expert, the training process represents a significant time investment. Active learning strategies attempt to overcome this bottleneck by presenting the most relevant examples to the annotator, thereby achieving high accuracy while minimizing the cost of obtaining labeled data. In this article, we investigate the impact of active learning on single-cell–based phenotype recognition, using data from three large-scale RNA interference high-content screens representing diverse phenotypic profiling problems. We consider several combinations of active learning strategies and popular SML methods. Our results show that active learning significantly reduces the time cost and can be used to reveal the same phenotypic targets identified using SML. We also identify combinations of active learning strategies and SML methods which perform better than others on the phenotypic profiling problems we studied.



Author(s):  
Yousra Hamrouni ◽  
Éric Paillassa ◽  
Véronique Chéret ◽  
Claude Monteil ◽  
David Sheeren

Reliable estimates of poplar plantations area are not available at the French national scale due to the unsuitability and low update rate of existing forest databases for this short-rotation species. While supervised classification methods have been shown to be highly accurate in mapping forest cover from remotely sensed images, their performance depends to a great extent on the labelled samples used to build the models. In addition to their high acquisition cost, such samples are often scarce and not fully representative of the variability in class distributions. Consequently, when classification models are applied to large areas with high intra-class variance, they generally yield poor accuracies. In this paper, we propose the use of active learning (AL) to efficiently adapt a classifier trained on a source image to spatially distinct target images with minimal labelling effort and without sacrificing classification performance. The adaptation consists in actively adding to the initial local model, new relevant training samples from other areas, in a cascade that iteratively improves the generalisation capabilities of the classifier, leading to a global model tailored to different areas. This active selection relies on uncertainty sampling to directly focus on the most informative pixels for which the algorithm is the least certain of their class labels. Experiments conducted on Sentinel-2 time series showed that when the same number of training samples was used, active learning outperformed passive learning (random sampling) by up to 5% of overall accuracy and up to 12% of class F-score. In addition, and depending on the class considered, the random sampling required up to 50% more samples to achieve the same performance of an active learning-based model. Moreover, the results demonstrate the suitability of the derived global model to accurately map poplar plantations among other tree species with overall accuracy values up to 14% higher than those obtained with local models. The proposed approach paves the way for national-scale mapping in an operational context.



2020 ◽  
Vol 34 (04) ◽  
pp. 6583-6590
Author(s):  
Yi-Fan Yan ◽  
Sheng-Jun Huang ◽  
Shaoyi Chen ◽  
Meng Liao ◽  
Jin Xu

Labeling a text document is usually time consuming because it requires the annotator to read the whole document and check its relevance with each possible class label. It thus becomes rather expensive to train an effective model for text classification when it involves a large dataset of long documents. In this paper, we propose an active learning approach for text classification with lower annotation cost. Instead of scanning all the examples in the unlabeled data pool to select the best one for query, the proposed method automatically generates the most informative examples based on the classification model, and thus can be applied to tasks with large scale or even infinite unlabeled data. Furthermore, we propose to approximate the generated example with a few summary words by sparse reconstruction, which allows the annotators to easily assign the class label by reading a few words rather than the long document. Experiments on different datasets demonstrate that the proposed approach can effectively improve the classification performance while significantly reduce the annotation cost.



Sign in / Sign up

Export Citation Format

Share Document