Large-scale and Robust Code Authorship Identification with Deep Feature Learning

2021 ◽  
Vol 24 (4) ◽  
pp. 1-35
Author(s):  
Mohammed Abuhamad ◽  
Tamer Abuhmed ◽  
David Mohaisen ◽  
Daehun Nyang

Successful software authorship de-anonymization has both software forensics applications and privacy implications. However, the process requires an efficient extraction of authorship attributes. The extraction of such attributes is very challenging, due to various software code formats from executable binaries with different toolchain provenance to source code with different programming languages. Moreover, the quality of attributes is bounded by the availability of software samples to a certain number of samples per author and a specific size for software samples. To this end, this work proposes a deep Learning-based approach for software authorship attribution, that facilitates large-scale, format-independent, language-oblivious, and obfuscation-resilient software authorship identification. This proposed approach incorporates the process of learning deep authorship attribution using a recurrent neural network, and ensemble random forest classifier for scalability to de-anonymize programmers. Comprehensive experiments are conducted to evaluate the proposed approach over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1,987 public repositories on GitHub. The results of our work show high accuracy despite requiring a smaller number of samples per author. Experimenting with source-code, our approach allows us to identify 8,903 GCJ authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Using the real-world dataset, we achieved an identification accuracy of 94.38% for 745 C programmers on GitHub. Moreover, the proposed approach is resilient to language-specifics, and thus it can identify authors of four programming languages (e.g., C, C++, Java, and Python), and authors writing in mixed languages (e.g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors. Experimenting with executable binaries, our approach achieves 95.74% for identifying 1,500 programmers of software binaries. Similar results were obtained when software binaries are generated with different compilation options, optimization levels, and removing of symbol information. Moreover, our approach achieves 93.86% for identifying 1,500 programmers of obfuscated binaries using all features adopted in Obfuscator-LLVM tool.

2018 ◽  
Vol 74 (12) ◽  
pp. 6753-6765 ◽  
Author(s):  
Seon Ho Oh ◽  
Seung-Wan Han ◽  
Bum-Suk Choi ◽  
Geon-Woo Kim ◽  
Kyung-Soo Lim

Symmetry ◽  
2020 ◽  
Vol 12 (12) ◽  
pp. 2044 ◽  
Author(s):  
Anna Kurtukova ◽  
Aleksandr Romanov ◽  
Alexander Shelupanov

Many open-source projects are developed by the community and have a common basis. The more source code is open, the more the project is open to contributors. The possibility of accidental or deliberate use of someone else’s source code as a closed functionality in another project (even a commercial) is not excluded. This situation could create copyright disputes. Adding a plagiarism check to the project lifecycle during software engineering solves this problem. However, not all code samples for comparing can be found in the public domain. In this case, the methods of identifying the source code author can be useful. Therefore, identifying the source code author is an important problem in software engineering, and it is also a research area in symmetry. This article discusses the problem of identifying the source code author and modern methods of solving this problem. Based on the experience of researchers in the field of natural language processing (NLP), the authors propose their technique based on a hybrid neural network and demonstrate its results both for simple cases of determining the authorship of the code and for those complicated by obfuscation and using of coding standards. The results show that the author’s technique successfully solves the essential problems of analogs and can be effective even in cases where there are no obvious signs indicating authorship. The average accuracy obtained for all programming languages was 95% in the simple case and exceeded 80% in the complicated ones.


2021 ◽  
Vol 20 (1) ◽  
Author(s):  
Rainer Schnell ◽  
Jonas Klingwort ◽  
James M. Farrow

Abstract Background We introduce and study a recently proposed method for privacy-preserving distance computations which has received little attention in the scientific literature so far. The method, which is based on intersecting sets of randomly labeled grid points, is henceforth denoted as ISGP allows calculating the approximate distances between masked spatial data. Coordinates are replaced by sets of hash values. The method allows the computation of distances between locations L when the locations at different points in time t are not known simultaneously. The distance between $$L_1$$ L 1 and $$L_2$$ L 2 could be computed even when $$L_2$$ L 2 does not exist at $$t_1$$ t 1 and $$L_1$$ L 1 has been deleted at $$t_2$$ t 2 . An example would be patients from a medical data set and locations of later hospitalizations. ISGP is a new tool for privacy-preserving data handling of geo-referenced data sets in general. Furthermore, this technique can be used to include geographical identifiers as additional information for privacy-preserving record-linkage. To show that the technique can be implemented in most high-level programming languages with a few lines of code, a complete implementation within the statistical programming language R is given. The properties of the method are explored using simulations based on large-scale real-world data of hospitals ($$n=850$$ n = 850 ) and residential locations ($$n=13,000$$ n = 13 , 000 ). The method has already been used in a real-world application. Results ISGP yields very accurate results. Our simulation study showed that—with appropriately chosen parameters – 99 % accuracy in the approximated distances is achieved. Conclusion We discussed a new method for privacy-preserving distance computations in microdata. The method is highly accurate, fast, has low computational burden, and does not require excessive storage.


2021 ◽  
Vol 54 (6) ◽  
pp. 1-37
Author(s):  
Shilin He ◽  
Pinjia He ◽  
Zhuangbin Chen ◽  
Tianyi Yang ◽  
Yuxin Su ◽  
...  

Logs are semi-structured text generated by logging statements in software source code. In recent decades, software logs have become imperative in the reliability assurance mechanism of many software systems, because they are often the only data available that record software runtime information. As modern software is evolving into a large scale, the volume of logs has increased rapidly. To enable effective and efficient usage of modern software logs in reliability engineering, a number of studies have been conducted on automated log analysis. This survey presents a detailed overview of automated log analysis research, including how to automate and assist the writing of logging statements, how to compress logs, how to parse logs into structured event templates, and how to employ logs to detect anomalies, predict failures, and facilitate diagnosis. Additionally, we survey work that releases open-source toolkits and datasets. Based on the discussion of the recent advances, we present several promising future directions toward real-world and next-generation automated log analysis.


2021 ◽  
Vol 2134 (1) ◽  
pp. 012011
Author(s):  
Alina Bogdanova ◽  
Vitaly Romanov

Abstract Source Code Authorship Attribution is a problem that is lately studied more often due improvements in Deep Learning techniques. Among existing solutions, two common issues are inability to add new authors without retraining and lack of interpretability. We address both these problem. In our experiments, we were able to correctly classify 75% of authors for diferent programming languages. Additionally, we applied techniques of explainable AI (XAI) and found that our model seems to pay attention to distinctive features of source code.


Author(s):  
Jie Wan ◽  
Jinfu Liu ◽  
Guorui Ren ◽  
Yufeng Guo ◽  
Daren Yu ◽  
...  

Day-ahead prediction of wind speed is a basic and key problem of large-scale wind power penetration. Many current techniques fail to satisfy practical engineering requirements because of wind speed's strong nonlinear features, influenced by many complex factors, and the general model's inability to automatically learn features. It is well recognized that wind speed varies in different patterns. In this paper, we propose a deep feature learning (DFL) approach to wind speed forecasting because of its advantages at both multi-layer feature extraction and unsupervised learning. A deep belief network (DBN) model for regression with an architecture of 144 input and 144 output nodes was constructed using a restricted Boltzmann machine (RBM). Day-ahead prediction experiments were then carried out. By comparing the experimental results, it was found that the prediction errors with respect to both size and stability of a DBN model with only three hidden layers were less than those of the other three typical approaches including support vector regression (SVR), single hidden layer neural networks (SHL-NN), and neural networks with three hidden layers (THL-NN). In addition, the DBN model can learn and obtain complex features of wind speed through its strong nonlinear mapping ability, which effectively improves its prediction precision. In addition, prediction errors are minimized when the number of DBN model's hidden layers reaches a threshold value. Above this number, it is not possible to improve the prediction accuracy by further increasing the number of hidden layers. Thus, the DBN method has a high practical value for wind speed prediction.


Author(s):  
Xing Hu ◽  
Ge Li ◽  
Xin Xia ◽  
David Lo ◽  
Shuai Lu ◽  
...  

Code summarization, aiming to generate succinct natural language description of source code, is extremely useful for code search and code comprehension. It has played an important role in software maintenance and evolution. Previous approaches generate summaries by retrieving summaries from similar code snippets. However, these approaches heavily rely on whether similar code snippets can be retrieved, how similar the snippets are, and fail to capture the API knowledge in the source code, which carries vital information about the functionality of the source code. In this paper, we propose a novel approach, named TL-CodeSum, which successfully uses API knowledge learned in a different but related task to code summarization. Experiments on large-scale real-world industry Java projects indicate that our approach is effective and outperforms the state-of-the-art in code summarization.


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 859
Author(s):  
Krishna Choudhary ◽  
Alexander R. Pico

Rapid technological advances in the past decades have enabled molecular biologists to generate large-scale and complex data with affordable resource investments, or obtain such data from public repositories. Yet, many graduate students, postdoctoral scholars, and senior researchers in the biosciences find themselves ill-equipped to analyze large-scale data. Global surveys have revealed that active researchers prefer short training workshops to fill their skill gaps. In this article, we focus on the challenge of delivering a short data analysis workshop to absolute beginners in computer programming. We propose that introducing R or other programming languages for data analysis as smart versions of calculators can help lower the communication barrier with absolute beginners. We describe this comparison with a few analogies and hope that other instructors will find them useful. We utilized these in our four-hour long training workshops involving participatory live coding, which we delivered in person and via videoconferencing. Anecdotal evidence suggests that our exposition made R programming seem easy and enabled beginners to explore it on their own.


Author(s):  
Thijs Smit ◽  
Niels Aage ◽  
Stephen J. Ferguson ◽  
Benedikt Helgason

AbstractThis paper presents a Python wrapper and extended functionality of the parallel topology optimization framework introduced by Aage et al. (Topology optimization using PETSc: an easy-to-use, fully parallel, open source topology optimization framework. Struct Multidiscip Optim 51(3):565–572, 2015). The Python interface, which simplifies the problem definition, is intended to expand the potential user base and to ease the use of large-scale topology optimization for educational purposes. The functionality of the topology optimization framework is extended to include passive domains and local volume constraints among others, which contributes to its usability to real-world design applications. The functionality is demonstrated via the cantilever beam, bracket and torsion ball examples. Several tests are provided which can be used to verify the proper installation and for evaluating the performance of the user’s system setup. The open-source code is available at https://github.com/thsmit/, repository $$\texttt {TopOpt\_in\_PETSc\_wrapped\_in\_Python}$$ TopOpt _ in _ PETSc _ wrapped _ in _ Python .


Diabetes ◽  
2020 ◽  
Vol 69 (Supplement 1) ◽  
pp. 1588-P ◽  
Author(s):  
ROMIK GHOSH ◽  
ASHOK K. DAS ◽  
AMBRISH MITHAL ◽  
SHASHANK JOSHI ◽  
K.M. PRASANNA KUMAR ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document