scholarly journals Explainable source code authorship attribution algorithm

2021 ◽  
Vol 2134 (1) ◽  
pp. 012011
Author(s):  
Alina Bogdanova ◽  
Vitaly Romanov

Abstract Source Code Authorship Attribution is a problem that is lately studied more often due improvements in Deep Learning techniques. Among existing solutions, two common issues are inability to add new authors without retraining and lack of interpretability. We address both these problem. In our experiments, we were able to correctly classify 75% of authors for diferent programming languages. Additionally, we applied techniques of explainable AI (XAI) and found that our model seems to pay attention to distinctive features of source code.

Electronics ◽  
2020 ◽  
Vol 9 (9) ◽  
pp. 1421
Author(s):  
Haechan Park ◽  
Nakhoon Baek

With the growth of artificial intelligence and deep learning technology, we have many active research works to apply the related techniques in various fields. To test and apply the latest machine learning techniques in gaming, it will be very useful to have a light-weight game engine for quick prototyping. Our game engine is implemented in a cost-effective way, in comparison to well-known commercial proprietary game engines, by utilizing open source products. Due to its simple internal architecture, our game engine is especially beneficial for modifying and reviewing the new functions through quick and repetitive tests. In addition, the game engine has a DNN (deep neural network) module, with which the proposed game engine can apply deep learning techniques to the game features, through applying deep learning algorithms in real-time. Our DNN module uses a simple C++ function interface, rather than additional programming languages and/or scripts. This simplicity enables us to apply machine learning techniques more efficiently and casually to the game applications. We also found some technical issues during our development with open sources. These issues mostly occurred while integrating various open source products into a single game engine. We present details of these technical issues and our solutions.


2021 ◽  
Vol 24 (4) ◽  
pp. 1-35
Author(s):  
Mohammed Abuhamad ◽  
Tamer Abuhmed ◽  
David Mohaisen ◽  
Daehun Nyang

Successful software authorship de-anonymization has both software forensics applications and privacy implications. However, the process requires an efficient extraction of authorship attributes. The extraction of such attributes is very challenging, due to various software code formats from executable binaries with different toolchain provenance to source code with different programming languages. Moreover, the quality of attributes is bounded by the availability of software samples to a certain number of samples per author and a specific size for software samples. To this end, this work proposes a deep Learning-based approach for software authorship attribution, that facilitates large-scale, format-independent, language-oblivious, and obfuscation-resilient software authorship identification. This proposed approach incorporates the process of learning deep authorship attribution using a recurrent neural network, and ensemble random forest classifier for scalability to de-anonymize programmers. Comprehensive experiments are conducted to evaluate the proposed approach over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1,987 public repositories on GitHub. The results of our work show high accuracy despite requiring a smaller number of samples per author. Experimenting with source-code, our approach allows us to identify 8,903 GCJ authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Using the real-world dataset, we achieved an identification accuracy of 94.38% for 745 C programmers on GitHub. Moreover, the proposed approach is resilient to language-specifics, and thus it can identify authors of four programming languages (e.g., C, C++, Java, and Python), and authors writing in mixed languages (e.g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors. Experimenting with executable binaries, our approach achieves 95.74% for identifying 1,500 programmers of software binaries. Similar results were obtained when software binaries are generated with different compilation options, optimization levels, and removing of symbol information. Moreover, our approach achieves 93.86% for identifying 1,500 programmers of obfuscated binaries using all features adopted in Obfuscator-LLVM tool.


IEEE Access ◽  
2019 ◽  
Vol 7 ◽  
pp. 141987-141999 ◽  
Author(s):  
Farhan Ullah ◽  
Junfeng Wang ◽  
Sohail Jabbar ◽  
Fadi Al-Turjman ◽  
Mamoun Alazab

Author(s):  
Shuai Zhang ◽  
Yi Tay ◽  
Lina Yao ◽  
Bin Wu ◽  
Aixin Sun

Deep learning based recommender systems have been extensively explored in recent years. However, the large number of models proposed each year poses a big challenge for both researchers and practitioners in reproducing the results for further comparisons. Although a portion of papers provides source code, they adopted different programming languages or different deep learning packages, which also raises the bar in grasping the ideas. To alleviate this problem, we released the open source project: \textbf{DeepRec}. In this toolkit, we have implemented a number of deep learning based recommendation algorithms using Python and the widely used deep learning package - Tensorflow. Three major recommendation scenarios: rating prediction, top-N recommendation (item ranking) and sequential recommendation, were considered. Meanwhile, DeepRec maintains good modularity and extensibility to easily incorporate new models into the framework. It is distributed under the terms of the GNU General Public License. The source code is available at github: https://github.com/cheungdaven/DeepRec


2020 ◽  
Author(s):  
Cut Nabilah Damni

AbstrakSoftware komputer atau perangkat lunak komputer merupakan kumpulan instruksi (program atau prosedur) untuk dapat melaksanakan pekerjaan secara otomatis dengan cara mengolah atau memproses kumpulan intruksi (data) yang diberikan. (Yahfizham, 2019 : 19) Sebagian besar dari software komputer dibuat oleh (programmer) dengan menggunakan bahasa pemprograman. Orang yang membuat bahasa pemprograman menuliskan perintah dalam bahasa pemprograman seperti layaknya bahasa yang digunakan oleh orang pada umumnya dalam melakukan perbincangan. Perintah-perintah tersebut dinamakan (source code). Program komputer lainnya dinamakan (compiler) yang digunakan pada (source code) dan kemudian mengubah perintah tersebut kedalam bahasa yang dimengerti oleh komputer lalu hasilnya dinamakan program executable (EXE). Pada dasarnya, komputer selalu memiliki perangkat lunak komputer atau software yang terdiri dari sistem operasi, sistem aplikasi dan bahasa pemograman.AbstractComputer software or computer software is a collection of instructions (programs or procedures) to be able to carry out work automatically by processing or processing the collection of instructions (data) provided. (Yahfizham, 2019: 19) Most of the computer software is made by (programmers) using the programming language. People who make programming languages write commands in the programming language like the language used by people in general in conducting conversation. The commands are called (source code). Other computer programs called (compilers) are used in (source code) and then change the command into a language understood by the computer and the results are called executable programs (EXE). Basically, computers always have computer software or software consisting of operating systems, application systems and programming languages.


Sign in / Sign up

Export Citation Format

Share Document