Explainable source code authorship attribution algorithm

Abstract Source Code Authorship Attribution is a problem that is lately studied more often due improvements in Deep Learning techniques. Among existing solutions, two common issues are inability to add new authors without retraining and lack of interpretability. We address both these problem. In our experiments, we were able to correctly classify 75% of authors for diferent programming languages. Additionally, we applied techniques of explainable AI (XAI) and found that our model seems to pay attention to distinctive features of source code.

Download Full-text

Developing an Open-Source Lightweight Game Engine with DNN Support

Electronics ◽

10.3390/electronics9091421 ◽

2020 ◽

Vol 9 (9) ◽

pp. 1421

Author(s):

Haechan Park ◽

Nakhoon Baek

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Open Source ◽

Programming Languages ◽

Cost Effective ◽

Machine Learning Techniques ◽

Learning Technology ◽

Game Engine ◽

Learning Techniques ◽

Technical Issues

With the growth of artificial intelligence and deep learning technology, we have many active research works to apply the related techniques in various fields. To test and apply the latest machine learning techniques in gaming, it will be very useful to have a light-weight game engine for quick prototyping. Our game engine is implemented in a cost-effective way, in comparison to well-known commercial proprietary game engines, by utilizing open source products. Due to its simple internal architecture, our game engine is especially beneficial for modifying and reviewing the new functions through quick and repetitive tests. In addition, the game engine has a DNN (deep neural network) module, with which the proposed game engine can apply deep learning techniques to the game features, through applying deep learning algorithms in real-time. Our DNN module uses a simple C++ function interface, rather than additional programming languages and/or scripts. This simplicity enables us to apply machine learning techniques more efficiently and casually to the game applications. We also found some technical issues during our development with open sources. These issues mostly occurred while integrating various open source products into a single game engine. We present details of these technical issues and our solutions.

Download Full-text

Large-scale and Robust Code Authorship Identification with Deep Feature Learning

ACM Transactions on Privacy and Security ◽

10.1145/3461666 ◽

2021 ◽

Vol 24 (4) ◽

pp. 1-35

Author(s):

Mohammed Abuhamad ◽

Tamer Abuhmed ◽

David Mohaisen ◽

Daehun Nyang

Keyword(s):

Programming Languages ◽

Real World ◽

Large Scale ◽

Source Code ◽

Feature Learning ◽

Identification Accuracy ◽

Authorship Attribution ◽

Deep Feature ◽

Public Repositories ◽

Authorship Identification

Successful software authorship de-anonymization has both software forensics applications and privacy implications. However, the process requires an efficient extraction of authorship attributes. The extraction of such attributes is very challenging, due to various software code formats from executable binaries with different toolchain provenance to source code with different programming languages. Moreover, the quality of attributes is bounded by the availability of software samples to a certain number of samples per author and a specific size for software samples. To this end, this work proposes a deep Learning-based approach for software authorship attribution, that facilitates large-scale, format-independent, language-oblivious, and obfuscation-resilient software authorship identification. This proposed approach incorporates the process of learning deep authorship attribution using a recurrent neural network, and ensemble random forest classifier for scalability to de-anonymize programmers. Comprehensive experiments are conducted to evaluate the proposed approach over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1,987 public repositories on GitHub. The results of our work show high accuracy despite requiring a smaller number of samples per author. Experimenting with source-code, our approach allows us to identify 8,903 GCJ authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Using the real-world dataset, we achieved an identification accuracy of 94.38% for 745 C programmers on GitHub. Moreover, the proposed approach is resilient to language-specifics, and thus it can identify authors of four programming languages (e.g., C, C++, Java, and Python), and authors writing in mixed languages (e.g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors. Experimenting with executable binaries, our approach achieves 95.74% for identifying 1,500 programmers of software binaries. Similar results were obtained when software binaries are generated with different compilation options, optimization levels, and removing of symbol information. Moreover, our approach achieves 93.86% for identifying 1,500 programmers of obfuscated binaries using all features adopted in Obfuscator-LLVM tool.

Download Full-text

Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

IEEE Access ◽

10.1109/access.2019.2943639 ◽

2019 ◽

Vol 7 ◽

pp. 141987-141999 ◽

Cited By ~ 4

Author(s):

Farhan Ullah ◽

Junfeng Wang ◽

Sohail Jabbar ◽

Fadi Al-Turjman ◽

Mamoun Alazab

Keyword(s):

Deep Learning ◽

Source Code ◽

Hybrid Approach ◽

Learning Model ◽

Dependence Graph ◽

Authorship Attribution ◽

Program Dependence Graph ◽

Deep Learning Model ◽

Program Dependence

Download Full-text

DeepRec: An Open-source Toolkit for Deep Learning based Recommendation

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/963 ◽

2019 ◽

Author(s):

Shuai Zhang ◽

Yi Tay ◽

Lina Yao ◽

Bin Wu ◽

Aixin Sun

Keyword(s):

Deep Learning ◽

Open Source ◽

Programming Languages ◽

Recommender Systems ◽

General Public ◽

Source Code ◽

Recommendation Algorithms ◽

New Models ◽

Rating Prediction ◽

General Public License

Deep learning based recommender systems have been extensively explored in recent years. However, the large number of models proposed each year poses a big challenge for both researchers and practitioners in reproducing the results for further comparisons. Although a portion of papers provides source code, they adopted different programming languages or different deep learning packages, which also raises the bar in grasping the ideas. To alleviate this problem, we released the open source project: \textbf{DeepRec}. In this toolkit, we have implemented a number of deep learning based recommendation algorithms using Python and the widely used deep learning package - Tensorflow. Three major recommendation scenarios: rating prediction, top-N recommendation (item ranking) and sequential recommendation, were considered. Meanwhile, DeepRec maintains good modularity and extensibility to easily incorporate new models into the framework. It is distributed under the terms of the GNU General Public License. The source code is available at github: https://github.com/cheungdaven/DeepRec

Download Full-text

Anomaly Detection and Categorization in Cloud Environment using Deep Learning Techniques

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i5.211214 ◽

2019 ◽

Vol 7 (5) ◽

pp. 211-214

Author(s):

Nidhi Thakkar ◽

Miren Karamta ◽

Seema Joshi ◽

M. B. Potdar

Keyword(s):

Deep Learning ◽

Anomaly Detection ◽

Cloud Environment ◽

Learning Techniques

Download Full-text

Deep Learning Techniques for Naskh and Nastalique Writing Style Text Recognition

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i4.7076 ◽

2019 ◽

Vol 7 (4) ◽

pp. 70-76

Author(s):

Shanky Goel ◽

Gurpreet Singh Lehal

Keyword(s):

Deep Learning ◽

Text Recognition ◽

Writing Style ◽

Learning Techniques

Download Full-text

A Study on the Types of Classic Fiction Using Deep Learning Techniques - Focusing on hero novels and romantic novels -

Korean Language and Literature in International Context ◽

10.31147/iall.84.1 ◽

2020 ◽

Vol 84 ◽

pp. 9-35

Author(s):

Woo-kyu Kang ◽

Ba-ro Kim

Keyword(s):

Deep Learning ◽

Learning Techniques

Download Full-text

PERANGKAT LUNAK KOMPUTER

10.31219/osf.io/tjbfr ◽

2020 ◽

Author(s):

Cut Nabilah Damni

Keyword(s):

Programming Languages ◽

Programming Language ◽

Operating Systems ◽

Source Code ◽

Computer Software ◽

Computer Programs ◽

Application Systems ◽

Executable Programs

AbstrakSoftware komputer atau perangkat lunak komputer merupakan kumpulan instruksi (program atau prosedur) untuk dapat melaksanakan pekerjaan secara otomatis dengan cara mengolah atau memproses kumpulan intruksi (data) yang diberikan. (Yahfizham, 2019 : 19) Sebagian besar dari software komputer dibuat oleh (programmer) dengan menggunakan bahasa pemprograman. Orang yang membuat bahasa pemprograman menuliskan perintah dalam bahasa pemprograman seperti layaknya bahasa yang digunakan oleh orang pada umumnya dalam melakukan perbincangan. Perintah-perintah tersebut dinamakan (source code). Program komputer lainnya dinamakan (compiler) yang digunakan pada (source code) dan kemudian mengubah perintah tersebut kedalam bahasa yang dimengerti oleh komputer lalu hasilnya dinamakan program executable (EXE). Pada dasarnya, komputer selalu memiliki perangkat lunak komputer atau software yang terdiri dari sistem operasi, sistem aplikasi dan bahasa pemograman.AbstractComputer software or computer software is a collection of instructions (programs or procedures) to be able to carry out work automatically by processing or processing the collection of instructions (data) provided. (Yahfizham, 2019: 19) Most of the computer software is made by (programmers) using the programming language. People who make programming languages write commands in the programming language like the language used by people in general in conducting conversation. The commands are called (source code). Other computer programs called (compilers) are used in (source code) and then change the command into a language understood by the computer and the results are called executable programs (EXE). Basically, computers always have computer software or software consisting of operating systems, application systems and programming languages.

Download Full-text