Uncovering source code reuse in large-scale academic environments

Enrique Flores; Alberto Barrón-Cedeño; Lidia Moreno; Paolo Rosso

doi:10.1002/cae.21608

Semi-automating small-scale source code reuse via structural correspondence

Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering - SIGSOFT '08/FSE-16 ◽

10.1145/1453101.1453130 ◽

2008 ◽

Cited By ~ 19

Author(s):

Rylan Cottrell ◽

Robert J. Walker ◽

Jörg Denzinger

Keyword(s):

Source Code ◽

Small Scale ◽

Code Reuse ◽

Structural Correspondence

Automatic detection of Long Method and God Class code smells through neural source code embeddings

10.36227/techrxiv.17206010.v1 ◽

2021 ◽

Author(s):

Aleksandar Kovačević ◽

Jelena Slivka ◽

Dragan Vidaković ◽

Katarina-Glorija Grujić ◽

Nikola Luburić ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Negative Impact ◽

Source Code ◽

Systematic Evaluation ◽

Small Scale ◽

Code Smells ◽

Code Metrics ◽

Code Smell ◽

F Measure

Code smells are structures in code that often have a negative impact on its quality. Manually detecting code smells is challenging and researchers proposed many automatic code smell detectors. Most of the studies propose detectors based on code metrics and heuristics. However, these studies have several limitations, including evaluating the detectors using small-scale case studies and an inconsistent experimental setting. Furthermore, heuristic-based detectors suffer from limitations that hinder their adoption in practice. Thus, researchers have recently started experimenting with machine learning (ML) based code smell detection. This paper compares the performance of multiple ML-based code smell detection models against multiple traditionally employed metric-based heuristics for detection of God Class and Long Method code smells. We evaluate the effectiveness of different source code representations for machine learning: traditionally used code metrics and code embeddings (code2vec, code2seq, and CuBERT). We perform our experiments on the large-scale, manually labeled MLCQ dataset. We consider the binary classification problem – we classify the code samples as smelly or non-smelly and use the F1-measure of the minority (smell) class as a measure of performance. In our experiments, the ML classifier trained using CuBERT source code embeddings achieved the best performance for both God Class (F-measure of 0.53) and Long Method detection (F-measure of 0.75). With the help of a domain expert, we perform the error analysis to discuss the advantages of the CuBERT approach. This study is the first to evaluate the effectiveness of pre-trained neural source code embeddings for code smell detection to the best of our knowledge. A secondary contribution of our study is the systematic evaluation of the effectiveness of multiple heuristic-based approaches on the same large-scale, manually labeled MLCQ dataset.

Pragmatic source code reuse via execution record and replay

Journal of Software Evolution and Process ◽

10.1002/smr.1790 ◽

2016 ◽

Vol 28 (8) ◽

pp. 642-664 ◽

Cited By ~ 4

Author(s):

Ameer Armaly ◽

Collin McMillan

Keyword(s):

Source Code ◽

Code Reuse ◽

Record And Replay

Results from a large-scale study of performance optimization techniques for source code analyses based on graph reachability algorithms

Proceedings Third IEEE International Workshop on Source Code Analysis and Manipulation ◽

10.1109/scam.2003.1238046 ◽

2004 ◽

Cited By ~ 9

Author(s):

D. Binkley ◽

M. Harman

Keyword(s):

Performance Optimization ◽

Large Scale ◽

Source Code ◽

Optimization Techniques ◽

Large Scale Study ◽

Graph Reachability

Detection of Redundant Condition Expression for Large Scale Source Code

Communications in Computer and Information Science - Geo-Spatial Knowledge and Intelligence ◽

10.1007/978-981-13-0893-2_33 ◽

2018 ◽

pp. 312-317

Author(s):

Dandan Gong ◽

Wensheng Xu ◽

Chunfang Qiu ◽

Libei Zhou

Keyword(s):

Large Scale ◽

Source Code

Logging Analysis and Prediction in Open Source Java Project

Research Anthology on Usage and Development of Open Source Software ◽

10.4018/978-1-7998-9158-1.ch038 ◽

2021 ◽

pp. 733-761

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Content Analysis ◽

Software Development ◽

Anomaly Detection ◽

Open Source ◽

Large Scale ◽

Source Code ◽

Scale Analysis ◽

Large Scale Analysis ◽

Research Questions

Log statements present in source code provide important information to the software developers because they are useful in various software development activities such as debugging, anomaly detection, and remote issue resolution. Most of the previous studies on logging analysis and prediction provide insights and results after analyzing only a few code constructs. In this chapter, the authors perform an in-depth, focused, and large-scale analysis of logging code constructs at two levels: the file level and catch-blocks level. They answer several research questions related to statistical and content analysis. Statistical and content analysis reveals the presence of differentiating properties among logged and nonlogged code constructs. Based on these findings, the authors propose a machine-learning-based model for catch-blocks logging prediction. The machine-learning-based model is found to be effective in catch-blocks logging prediction.

OpenBioLink: a benchmarking framework for large-scale biomedical link prediction

Bioinformatics ◽

10.1093/bioinformatics/btaa274 ◽

2020 ◽

Vol 36 (13) ◽

pp. 4097-4098 ◽

Cited By ~ 3

Author(s):

Anna Breit ◽

Simon Ott ◽

Asan Agibetov ◽

Matthias Samwald

Keyword(s):

Link Prediction ◽

Large Scale ◽

Source Code ◽

Machine Learning Algorithms ◽

Knowledge Networks ◽

Supplementary Information ◽

Supplementary Data ◽

Biomedical Knowledge ◽

High Quality ◽

Baseline Evaluation

Abstract Summary Recently, novel machine-learning algorithms have shown potential for predicting undiscovered links in biomedical knowledge networks. However, dedicated benchmarks for measuring algorithmic progress have not yet emerged. With OpenBioLink, we introduce a large-scale, high-quality and highly challenging biomedical link prediction benchmark to transparently and reproducibly evaluate such algorithms. Furthermore, we present preliminary baseline evaluation results. Availability and implementation Source code and data are openly available at https://github.com/OpenBioLink/OpenBioLink. Supplementary information Supplementary data are available at Bioinformatics online.

Source Code Generation for large scale applications

2013 The International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (TAEECE) ◽

10.1109/taeece.2013.6557309 ◽

2013 ◽

Author(s):

Havva Cetiner Altiparmak ◽

Busra Tokgoz ◽

Okkes Emin Balcicek ◽

Aslihan Ozkaya ◽

Ahmet Arslan

Keyword(s):

Code Generation ◽

Large Scale ◽

Source Code

A Case Study of Refactoring Large-Scale Industrial Systems to Efficiently Improve Source Code Quality

Computational Science and Its Applications – ICCSA 2014 - Lecture Notes in Computer Science ◽

10.1007/978-3-319-09156-3_37 ◽

2014 ◽

pp. 524-540 ◽

Cited By ~ 6

Author(s):

Gábor Szőke ◽

Csaba Nagy ◽

Rudolf Ferenc ◽

Tibor Gyimóthy

Keyword(s):

Large Scale ◽

Source Code ◽

Industrial Systems ◽

Code Quality

epiTAD: a web application for visualizing chromosome conformation capture data in the context of genetic epidemiology

Bioinformatics ◽

10.1093/bioinformatics/btz387 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4462-4464

Author(s):

Jordan H Creed ◽

Garrick Aden-Buie ◽

Alvaro N Monteiro ◽

Travis A Gerke

Keyword(s):

Genetic Epidemiology ◽

In Silico ◽

Web Application ◽

Large Scale ◽

Source Code ◽

Chromosome Conformation Capture ◽

Chromosome Conformation ◽

Web Based ◽

Genome Wide ◽

Public Data

Abstract Summary Complementary advances in genomic technology and public data resources have created opportunities for researchers to conduct multifaceted examination of the genome on a large scale. To meet the need for integrative genome wide exploration, we present epiTAD. This web-based tool enables researchers to compare genomic 3D organization and annotations across multiple databases in an interactive manner to facilitate in silico discovery. Availability and implementation epiTAD can be accessed at https://apps.gerkelab.com/epiTAD/ where we have additionally made publicly available the source code and a Docker containerized version of the application.