scholarly journals Cross-platform binary code similarity detection based on NMT and graph embedding

2021 ◽  
Vol 18 (4) ◽  
pp. 4528-4551
Author(s):  
Xiaodong Zhu ◽  
◽  
Liehui Jiang ◽  
Zeng Chen ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-8
Author(s):  
Shen Wang ◽  
Xunzhi Jiang ◽  
Xiangzhan Yu ◽  
Xiaohui Su

Binary code homology analysis refers to detecting whether two pieces of binary code are compiled from the same piece of source code, which is a fundamental technique for many security applications, such as vulnerability search, plagiarism detection, and malware detection. With the increase in critical vulnerabilities in IoT devices, homology analysis is increasingly needed to perform cross-platform vulnerability searches. Existing methods for cross-platform binary code homology detection usually convert binary code to instruction sequences and do semantic embedding of the sequences as if they were natural language. However, the gap between natural language and binary code is large, and the spatial features of the binary code are easily lost by directly comparing the semantics. In this paper, we propose a GRU-based graph embedding method to compare the homology of binary functions. First, the attribute control flow graph (ACFG) is built for the assembly function, then the GRU-based graph embedding neural network is used to generate the embedding vector for the ACFG, and finally the homology of the binary code is determined by calculating the distance between the embedding vectors. The experimental results show that our method greatly improves the detection accuracy of negative samples compared with Gemini, the latest method based on graph embedding binary code similarity detection.


2020 ◽  
Vol 34 (01) ◽  
pp. 1145-1152 ◽  
Author(s):  
Zeping Yu ◽  
Rui Cao ◽  
Qiyi Tang ◽  
Sen Nie ◽  
Junzhou Huang ◽  
...  

Binary code similarity detection, whose goal is to detect similar binary functions without having access to the source code, is an essential task in computer security. Traditional methods usually use graph matching algorithms, which are slow and inaccurate. Recently, neural network-based approaches have made great achievements. A binary function is first represented as an control-flow graph (CFG) with manually selected block features, and then graph neural network (GNN) is adopted to compute the graph embedding. While these methods are effective and efficient, they could not capture enough semantic information of the binary code. In this paper we propose semantic-aware neural networks to extract the semantic information of the binary code. Specially, we use BERT to pre-train the binary code on one token-level task, one block-level task, and two graph-level tasks. Moreover, we find that the order of the CFG's nodes is important for graph similarity detection, so we adopt convolutional neural network (CNN) on adjacency matrices to extract the order information. We conduct experiments on two tasks with four datasets. The results demonstrate that our method outperforms the state-of-art models.


Author(s):  
Zhengping Luo ◽  
Tao Hou ◽  
Xiangrong Zhou ◽  
Hui Zeng ◽  
Zhuo Lu

IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 120501-120512
Author(s):  
Hui Guo ◽  
Shuguang Huang ◽  
Cheng Huang ◽  
Min Zhang ◽  
Zulie Pan ◽  
...  

2021 ◽  
Vol 2021 ◽  
pp. 1-19
Author(s):  
Yan Wang ◽  
Peng Jia ◽  
Cheng Huang ◽  
Jiayong Liu ◽  
Peisong He

Binary code similarity comparison is the technique that determines if two functions are similar by only considering their compiled form, which has many applications, including clone detection, malware classification, and vulnerability discovery. However, it is challenging to design a robust code similarity comparison engine since different compilation settings that make logically similar assembly functions appear to be very different. Moreover, existing approaches suffer from high-performance overheads, lower robustness, or poor scalability. In this paper, a novel solution HBinSim is proposed by employing the multiview features of the function to address these challenges. It first extracts the syntactic and semantic features of each basic block by static analysis. HBinSim further analyzes the function and constructs a syntactic attribute control flow graph and a semantic attribute control flow graph for each function. Then, a hierarchical attention graph embedding network is designed for graph-structured data processing. The network model has a hierarchical structure that mirrors the hierarchical structure of the function. It has three levels of attention mechanisms applied at the instruction, basic block, and function level, enabling it to attend differentially to more and less critical content when constructing the function representation. We conduct extensive experiments to evaluate its effectiveness and efficiency. The results show that our tool outperforms the state-of-the-art binary code similarity comparison tools by a large margin against compilation diversity clone searching. A real-world vulnerabilities search case further demonstrates the usefulness of our system.


2021 ◽  
Vol 168 ◽  
pp. 114348
Author(s):  
Donghai Tian ◽  
Xiaoqi Jia ◽  
Rui Ma ◽  
Shuke Liu ◽  
Wenjing Liu ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document