On the Reproducibility and Replicability of Deep Learning in Software Engineering

Chao Liu; Cuiyun Gao; Xin Xia; David Lo; John Grundy; Xiaohu Yang

doi:10.1145/3477535

On the Reproducibility and Replicability of Deep Learning in Software Engineering

ACM Transactions on Software Engineering and Methodology ◽

10.1145/3477535 ◽

2022 ◽

Vol 31 (1) ◽

pp. 1-46

Author(s):

Chao Liu ◽

Cuiyun Gao ◽

Xin Xia ◽

David Lo ◽

John Grundy ◽

...

Keyword(s):

Deep Learning ◽

Software Engineering ◽

Source Code ◽

Experimental Results ◽

Supervised Machine Learning ◽

Optimization Process ◽

Experimental Result ◽

Experimental Setup ◽

High Quality ◽

Two Factors

Context: Deep learning (DL) techniques have gained significant popularity among software engineering (SE) researchers in recent years. This is because they can often solve many SE challenges without enormous manual feature engineering effort and complex domain knowledge. Objective: Although many DL studies have reported substantial advantages over other state-of-the-art models on effectiveness, they often ignore two factors: (1) reproducibility —whether the reported experimental results can be obtained by other researchers using authors’ artifacts (i.e., source code and datasets) with the same experimental setup; and (2) replicability —whether the reported experimental result can be obtained by other researchers using their re-implemented artifacts with a different experimental setup. We observed that DL studies commonly overlook these two factors and declare them as minor threats or leave them for future work. This is mainly due to high model complexity with many manually set parameters and the time-consuming optimization process, unlike classical supervised machine learning (ML) methods (e.g., random forest). This study aims to investigate the urgency and importance of reproducibility and replicability for DL studies on SE tasks. Method: In this study, we conducted a literature review on 147 DL studies recently published in 20 SE venues and 20 AI (Artificial Intelligence) venues to investigate these issues. We also re-ran four representative DL models in SE to investigate important factors that may strongly affect the reproducibility and replicability of a study. Results: Our statistics show the urgency of investigating these two factors in SE, where only 10.2% of the studies investigate any research question to show that their models can address at least one issue of replicability and/or reproducibility. More than 62.6% of the studies do not even share high-quality source code or complete data to support the reproducibility of their complex models. Meanwhile, our experimental results show the importance of reproducibility and replicability, where the reported performance of a DL model could not be reproduced for an unstable optimization process. Replicability could be substantially compromised if the model training is not convergent, or if performance is sensitive to the size of vocabulary and testing data. Conclusion: It is urgent for the SE community to provide a long-lasting link to a high-quality reproduction package, enhance DL-based solution stability and convergence, and avoid performance sensitivity on different sampled data.

Download Full-text

pDeep3: Towards More Accurate Spectrum Prediction with Fast Few-Shot Learning

10.1101/2020.09.13.295105 ◽

2020 ◽

Author(s):

Ching Tarn ◽

Wen-Feng Zeng ◽

Zhengcong Fei ◽

Si-Min He

Keyword(s):

Deep Learning ◽

Data Quality ◽

Prediction Accuracy ◽

Source Code ◽

Experimental Results ◽

Collision Induced Dissociation ◽

Learning Method ◽

Energy Collision ◽

Spectrum Prediction ◽

The Difference

AbstractSpectrum prediction using deep learning has attracted a lot of attention in recent years. Although existing deep learning methods have dramatically increased the prediction accuracy, there is still considerable space for improvement, which is presently limited by the difference of fragmentation types or instrument settings. In this work, we use the few-shot learning method to fit the data online to make up for the shortcoming. The method is evaluated using ten datasets, where the instruments includes Velos, QE, Lumos, and Sciex, with collision energies being differently set. Experimental results show that few-shot learning can achieve higher prediction accuracy with almost negligible computing resources. For example, on the dataset from a untrained instrument Sciex-6600, within about 10 seconds, the prediction accuracy is increased from 69.7% to 86.4%; on the CID (collision-induced dissociation) dataset, the prediction accuracy of the model trained by HCD (higher energy collision dissociation) spectra is increased from 48.0% to 83.9%. It is also shown that, the method is not critical to data quality and is sufficiently efficient to fill the accuracy gap. The source code of pDeep3 is available at http://pfind.ict.ac.cn/software/pdeep3.

Download Full-text

Spam detection and high-quality features to analyse question –answer pairs

The Electronic Library ◽

10.1108/el-05-2020-0120 ◽

2020 ◽

Vol 38 (5/6) ◽

pp. 1013-1033

Author(s):

Hei Chia Wang ◽

Yu Hung Chiang ◽

Si Ting Lin

Keyword(s):

Deep Learning ◽

Experimental Results ◽

Quality Analysis ◽

Phase Identification ◽

High Quality ◽

Two Phase ◽

Content Type ◽

Different Types ◽

Answer Quality ◽

Quality Features

Purpose In community question and answer (CQA) services, because of user subjectivity and the limits of knowledge, the distribution of answer quality can vary drastically – from highly related to irrelevant or even spam answers. Previous studies of CQA portals have faced two important issues: answer quality analysis and spam answer filtering. Therefore, the purposes of this study are to filter spam answers in advance using two-phase identification methods and then automatically classify the different types of question and answer (QA) pairs by deep learning. Finally, this study proposes a comprehensive study of answer quality prediction for different types of QA pairs. Design/methodology/approach This study proposes an integrated model with a two-phase identification method that filters spam answers in advance and uses a deep learning method [recurrent convolutional neural network (R-CNN)] to automatically classify various types of questions. Logistic regression (LR) is further applied to examine which answer quality features significantly indicate high-quality answers to different types of questions. Findings There are four prominent findings. (1) This study confirms that conducting spam filtering before an answer quality analysis can reduce the proportion of high-quality answers that are misjudged as spam answers. (2) The experimental results show that answer quality is better when question types are included. (3) The analysis results for different classifiers show that the R-CNN achieves the best macro-F1 scores (74.8%) in the question type classification module. (4) Finally, the experimental results by LR show that author ranking, answer length and common words could significantly impact answer quality for different types of questions. Originality/value The proposed system is simultaneously able to detect spam answers and provide users with quick and efficient retrieval mechanisms for high-quality answers to different types of questions in CQA. Moreover, this study further validates that crucial features exist among the different types of questions that can impact answer quality. Overall, an identification system automatically summarises high-quality answers for each different type of questions from the pool of messy answers in CQA, which can be very useful in helping users make decisions.

Download Full-text

A Neural Network Based Intelligent Support Model for Program Code Completion

Scientific Programming ◽

10.1155/2020/7426461 ◽

2020 ◽

Vol 2020 ◽

pp. 1-18

Author(s):

Md. Mostafizer Rahman ◽

Yutaka Watanobe ◽

Keita Nakamura

Keyword(s):

Neural Network ◽

Software Engineering ◽

Error Detection ◽

State Of The Art ◽

Source Code ◽

Experimental Results ◽

Intelligent Support ◽

Source Codes ◽

Proposed Model ◽

Code Completion

In recent years, millions of source codes are generated in different languages on a daily basis all over the world. A deep neural network-based intelligent support model for source code completion would be a great advantage in software engineering and programming education fields. Vast numbers of syntax, logical, and other critical errors that cannot be detected by normal compilers continue to exist in source codes, and the development of an intelligent evaluation methodology that does not rely on manual compilation has become essential. Even experienced programmers often find it necessary to analyze an entire program in order to find a single error and are thus being forced to waste valuable time debugging their source codes. With this point in mind, we proposed an intelligent model that is based on long short-term memory (LSTM) and combined it with an attention mechanism for source code completion. Thus, the proposed model can detect source code errors with locations and then predict the correct words. In addition, the proposed model can classify the source codes as to whether they are erroneous or not. We trained our proposed model using the source code and then evaluated the performance. All of the data used in our experiments were extracted from Aizu Online Judge (AOJ) system. The experimental results obtained show that the accuracy in terms of error detection and prediction of our proposed model approximately is 62% and source code classification accuracy is approximately 96% which outperformed a standard LSTM and other state-of-the-art models. Moreover, in comparison to state-of-the-art models, our proposed model achieved an interesting level of success in terms of error detection, prediction, and classification when applied to long source code sequences. Overall, these experimental results indicate the usefulness of our proposed model in software engineering and programming education arena.

Download Full-text

6mA-Pred: identifying DNA N6-methyladenine sites based on deep learning

PeerJ ◽

10.7717/peerj.10813 ◽

2021 ◽

Vol 9 ◽

pp. e10813

Author(s):

Qianfei Huang ◽

Wenyang Zhou ◽

Fei Guo ◽

Lei Xu ◽

Lichao Zhang

Keyword(s):

Deep Learning ◽

Mus Musculus ◽

Source Code ◽

Experimental Results ◽

Individual Species ◽

Identification Method ◽

Excellent Method ◽

Multiple Species ◽

Site Recognition

With the accumulation of data on 6mA modification sites, an increasing number of scholars have begun to focus on the identification of 6mA sites. Despite the recognized importance of 6mA sites, methods for their identification remain lacking, with most existing methods being aimed at their identification in individual species. In the present study, we aimed to develop an identification method suitable for multiple species. Based on previous research, we propose a method for 6mA site recognition. Our experiments prove that the proposed 6mA-Pred method is effective for identifying 6mA sites in genes from taxa such as rice, Mus musculus, and human. A series of experimental results show that 6mA-Pred is an excellent method. We provide the source code used in the study, which can be obtained from http://39.100.246.211:5004/6mA_Pred/.

Download Full-text

Genetic algorithm of optimizing the qualification of programmer teams

«System analysis and applied information science» ◽

10.21122/2309-4923-2020-4-31-38 ◽

2021 ◽

pp. 31-38

Author(s):

A. A. Prihozhy ◽

A. M. Zhdanouski

Keyword(s):

Genetic Algorithm ◽

Fitness Function ◽

Combinatorial Problem ◽

Point Of View ◽

Experimental Results ◽

Optimization Process ◽

High Quality

The partitioning a set of professional programmers into a set of teams when a programming project specifies requirements to the competency in various programming technologies and tools is a hard combinatorial problem. The paper proposes a genetic algorithm, which is capable of finding competitive and high-quality partitioning solutions in acceptable runtime. The algorithm introduces chromosomes in such a way as to assign each programmer to a team, define the team staff and easily reconstruct the teams during optimization process. A fitness function characterizes each chromosome with respect to the quality of the programmers partitioning. It accounts for the average qualification of teams and the qualification of team best representatives on each of the technologies. The function recognizes the teams that meet all constraints on the project and are workable from this point of view. It is also capable of recognizing the teams that do not meet the constraints and are unworkable. The algorithm defines the genetic operations of selection, crossing and mutation in such a way as to move programmers from unworkable to workable teams, to increase the number of workable teams, to ex-change programmers among workable teams, to increase the competency of every workable team, and thus to maximize the teams overall qualification. Experimental results obtained on a set of programmers graduated from Belarus universities show the capability of the genetic algorithm to find good partitioning solutions, maximize the teams’ competency and minimize the number of unemployed programmers.

Download Full-text

Experimental results from face and lateral fine grinding of fused silica and BK7 using metal and resin bonded tools

EPJ Web of Conferences ◽

10.1051/epjconf/202125503002 ◽

2021 ◽

Vol 255 ◽

pp. 03002

Author(s):

Christian Schulze ◽

Sebastian Henkel ◽

Jens Bliedtner

Keyword(s):

Processing Time ◽

Fused Silica ◽

Experimental Results ◽

Experimental Setup ◽

Metallic Materials ◽

High Quality ◽

Fine Grinding ◽

Processing Step ◽

Ultra Fine Grinding ◽

Face Grinding

The experimental setup shall compare face grinding with lateral grinding in a single processing step using fused silica and BK7 as materials. Resin and metal bonded tools are used in face and lateral grinding strategies. The paper presents the results in order to deepen the knowledge about those grinding technologies and to further improve the properties of fabricated components. Furthermore, the results confirm, that ultra-fine grinding is a technology, which can be used to process inorganic non-metallic materials with high quality surfaces with a low roughness and high flatness in a low processing time.

Download Full-text

Reduced Simulation: Real-to-Sim Approach toward Collision Detection in Narrowly Confined Environments

Robotics ◽

10.3390/robotics10040131 ◽

2021 ◽

Vol 10 (4) ◽

pp. 131

Author(s):

Yusuke Takayama ◽

Photchara Ratsamee ◽

Tomohiro Mashita

Keyword(s):

Deep Learning ◽

Real World ◽

Collision Detection ◽

User Study ◽

Cost Effective ◽

Experimental Result ◽

Simulation Design ◽

High Quality ◽

Simulated Environments ◽

Simulated Environment

Recently, several deep-learning based navigation methods have been achieved because of a high quality dataset collected from high-quality simulated environments. However, the cost of creating high-quality simulated environments is high. In this paper, we present a concept of the reduced simulation, which can serve as a simplified version of a simulated environment yet be efficient enough for training deep-learning based UAV collision avoidance approaches. Our approach deals with the reality gap between a reduced simulation dataset and real world dataset and can provide a clear guideline for reduced simulation design. Our experimental result confirmed that the reduction in visual features provided by textures and lighting does not affect operating performance with the user study. Moreover, by conducting collision detection experiments, we verified that our reduced simulation outperforms the conventional cost-effective simulations in adaptation capability with respect to realistic simulation and real-world scenario.

Download Full-text

Detecting “DeepFakes” in H.264 Video Data Using Compression Ghost Artifacts

Electronic Imaging ◽

10.2352/issn.2470-1173.2020.4.mwsf-116 ◽

2020 ◽

Vol 2020 (4) ◽

pp. 116-1-116-7

Author(s):

Raphael Antonius Frick ◽

Sascha Zmudzinski ◽

Martin Steinebach

Keyword(s):

Image Forensics ◽

Video Data ◽

Experimental Results ◽

Video Sequences ◽

The Internet ◽

Video Content ◽

High Quality ◽

The Public

In recent years, the number of forged videos circulating on the Internet has immensely increased. Software and services to create such forgeries have become more and more accessible to the public. In this regard, the risk of malicious use of forged videos has risen. This work proposes an approach based on the Ghost effect knwon from image forensics for detecting forgeries in videos that can replace faces in video sequences or change the mimic of a face. The experimental results show that the proposed approach is able to identify forgery in high-quality encoded video content.

Download Full-text