Threat Hunting as a Similarity Search Problem on Multi-positive and Unlabeled Data

Various problems are just rising with regard to mining in massive datasets, among which finding similar documents can be pinpointed. The Shingling method converts this problem to a set-based problem. Some of existing methods have used min-hashing to compress the results already driven from the shingling method and then have exploited LSH method to find candidate pairs for similarity search from all pairs of documents. In this paper, an apriori-based method is proposed for finding similar documents based on frequent itemset mining approach. To this end, the apriori algorithm is modified and is customized for similarity search problem. Modeling the similarity search problem as a frequent pattern mining problem, using a modified version of apriori, and dynamic selection the minimum support threshold are the most important advantages of the proposed method, which lead to its appropriate execution time and high quality results. The proposed method finds similar documents in less time than the combined method and MCVM method because it generates fewer candidate pairs for finding similar documents. Furthermore, experimental results show the high quality of the answers of the proposed methods.

Download Full-text

An Incremental Prefix Filtering Approach for the All Pairs Similarity Search Problem

2010 12th International Asia-Pacific Web Conference ◽

10.1109/apweb.2010.30 ◽

2010 ◽

Cited By ~ 2

Author(s):

Hoang Thanh Lam ◽

Dinh Viet Dung ◽

Raffaele Perego ◽

Fabrizio Silvestri

Keyword(s):

Similarity Search ◽

Search Problem ◽

Filtering Approach

Download Full-text

Similarity Search Problem Research on Multi-dimensional Data Sets

2013 10th International Conference on Information Technology: New Generations ◽

10.1109/itng.2013.72 ◽

2013 ◽

Author(s):

Yong Shi ◽

Brian Graham

Keyword(s):

Similarity Search ◽

Search Problem ◽

Data Sets

Download Full-text

Efficient Subsequence Search on Streaming Data Based on Time Warping Distance

ECTI Transactions on Computer and Information Technology (ECTI-CIT) ◽

10.37936/ecti-cit.201151.54225 ◽

1970 ◽

Vol 5 (1) ◽

pp. 2-8 ◽

Cited By ~ 1

Author(s):

Sura Rodpongpun ◽

Vit Niennattrakul ◽

Chotirat Ann Ratanamahatana

Keyword(s):

Time Series ◽

Similarity Search ◽

Time Series Data ◽

Distance Measure ◽

Computational Cost ◽

Streaming Data ◽

Series Data ◽

Search Problem ◽

Time Warping ◽

Subsequence Matching

Many algorithms have been proposed to deal with subsequence similarity search problem in time series data stream. Dynamic Time Warping (DTW), which has been accepted as the best distance measure in time series similarity search, has been used in many research works. SPRING and its variance were proposed to solve such problem by mitigating the complexity of DTW. Unfortunately, these algorithms produce meaningless result since no normalization is taken into account before the distance calculation. Recently, GPUs and FPGAs were used in similarity search supporting subsequence normalization to reduce the computation complexity, but it is still far from practical use. In this work, we propose a novel Meaningful Subsequence Matching (MSM) algorithm which produces meaningful result in subsequence matching by considering global constraint, uniform scaling, and normalization. Our method significantly outperforms the existing algorithms in terms of both computational cost and accuracy.

Download Full-text