Automatic detection of Long Method and God Class code smells through neural source code embeddings

Code smells are structures in code that often have a negative impact on its quality. Manually detecting code smells is challenging and researchers proposed many automatic code smell detectors. Most of the studies propose detectors based on code metrics and heuristics. However, these studies have several limitations, including evaluating the detectors using small-scale case studies and an inconsistent experimental setting. Furthermore, heuristic-based detectors suffer from limitations that hinder their adoption in practice. Thus, researchers have recently started experimenting with machine learning (ML) based code smell detection. This paper compares the performance of multiple ML-based code smell detection models against multiple traditionally employed metric-based heuristics for detection of God Class and Long Method code smells. We evaluate the effectiveness of different source code representations for machine learning: traditionally used code metrics and code embeddings (code2vec, code2seq, and CuBERT). We perform our experiments on the large-scale, manually labeled MLCQ dataset. We consider the binary classification problem – we classify the code samples as smelly or non-smelly and use the F1-measure of the minority (smell) class as a measure of performance. In our experiments, the ML classifier trained using CuBERT source code embeddings achieved the best performance for both God Class (F-measure of 0.53) and Long Method detection (F-measure of 0.75). With the help of a domain expert, we perform the error analysis to discuss the advantages of the CuBERT approach. This study is the first to evaluate the effectiveness of pre-trained neural source code embeddings for code smell detection to the best of our knowledge. A secondary contribution of our study is the systematic evaluation of the effectiveness of multiple heuristic-based approaches on the same large-scale, manually labeled MLCQ dataset.

Download Full-text

Engrossing Prosecution of Code Smells Type Identification and Rectification using Machine Learning AdaBoost Classifier

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.d1331.029420 ◽

2020 ◽

Vol 9 (4) ◽

pp. 624-631

Keyword(s):

Machine Learning ◽

Decision Tree ◽

False Positive Rate ◽

Source Code ◽

Minimum Cost ◽

Space Complexity ◽

Identification Accuracy ◽

Code Smells ◽

Code Smell ◽

Software Code

Software code smells are the structural features which reside in a software source code. Code smell detection is an established method to discover the problems in source code and reorganize the inner structure of object-oriented software for improving the quality of such software, particularly in terms of maintainability, reusability and cost minimization. The developer identified where the code smell is identified and rectified within a system is a major challenging issue. The various code smell detection technique has been designed but it failed to classify the code type and minimum rectification cost. In order to perform classification with minimum cost, an efficient technique called Machine Learning Ada-Boost Classifier (MLABC) technique is introduced. The MLABC technique improves the software quality by identifying and rectifying the different types of software code smell in source code. Initially, MLABC technique uses decision tree as base classifier to identify the code smell type. The decision tree is used to classify the code smell type based on the certain rule. After that, the base classifiers are combined to make a strong classifier using adaboost machine learning technique. The output of strong classifier is used to identify the code smell type. Finally, the code smell type rectification is performed by applying the refactoring technique where the code smell is identified with minimum cost and space complexity. Experimental results shows that the proposed MLABC technique improves the software code quality in terms of code smell type identification accuracy, false positive rate, code smell type rectification cost and space complexity with the source code

Download Full-text

A Review on Machine-Learning Based Code Smell Detection Techniquesin Object-Oriented Software System(s)

Recent Advances in Electrical & Electronic Engineering (Formerly Recent Patents on Electrical & Electronic Engineering) ◽

10.2174/2352096513999200922125839 ◽

2020 ◽

Vol 13 ◽

Author(s):

Amandeep Kaur ◽

Sushma Jain ◽

Shivani Goel ◽

Gaurav Dhiman

Keyword(s):

Machine Learning ◽

Programming Languages ◽

Software Quality ◽

Empirical Studies ◽

Statistical Testing ◽

Machine Learning Techniques ◽

Support Vector ◽

Code Smells ◽

Detection Techniques ◽

Code Smell

Context: Code smells are symptoms, that something may be wrong in software systems that can cause complications in maintaining software quality. In literature, there exists many code smells and their identification is far from trivial. Thus, several techniques have also been proposed to automate code smell detection in order to improve software quality. Objective: This paper presents an up-to-date review of simple and hybrid machine learning based code smell detection techniques and tools. Methods: We collected all the relevant research published in this field till 2020. We extracted the data from those articles and classified them into two major categories. In addition, we compared the selected studies based on several aspects like, code smells, machine learning techniques, datasets, programming languages used by datasets, dataset size, evaluation approach, and statistical testing. Results: Majority of empirical studies have proposed machine- learning based code smell detection tools. Support vector machine and decision tree algorithms are frequently used by the researchers. Along with this, a major proportion of research is conducted on Open Source Softwares (OSS) such as, Xerces, Gantt Project and ArgoUml. Furthermore, researchers paid more attention towards Feature Envy and Long Method code smells. Conclusion: We identified several areas of open research like, need of code smell detection techniques using hybrid approaches, need of validation employing industrial datasets, etc.

Download Full-text

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Applied Sciences ◽

10.3390/app11020472 ◽

2021 ◽

Vol 11 (2) ◽

pp. 472

Author(s):

Hyeongmin Cho ◽

Sangkyun Lee

Keyword(s):

Machine Learning ◽

Data Quality ◽

Large Scale ◽

High Dimensional Data ◽

Quality Measures ◽

Training Data ◽

Measure Data ◽

High Dimensional ◽

Small Scale ◽

Class Separability

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

Download Full-text

Logging Analysis and Prediction in Open Source Java Project

Research Anthology on Usage and Development of Open Source Software ◽

10.4018/978-1-7998-9158-1.ch038 ◽

2021 ◽

pp. 733-761

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Content Analysis ◽

Software Development ◽

Anomaly Detection ◽

Open Source ◽

Large Scale ◽

Source Code ◽

Scale Analysis ◽

Large Scale Analysis ◽

Research Questions

Log statements present in source code provide important information to the software developers because they are useful in various software development activities such as debugging, anomaly detection, and remote issue resolution. Most of the previous studies on logging analysis and prediction provide insights and results after analyzing only a few code constructs. In this chapter, the authors perform an in-depth, focused, and large-scale analysis of logging code constructs at two levels: the file level and catch-blocks level. They answer several research questions related to statistical and content analysis. Statistical and content analysis reveals the presence of differentiating properties among logged and nonlogged code constructs. Based on these findings, the authors propose a machine-learning-based model for catch-blocks logging prediction. The machine-learning-based model is found to be effective in catch-blocks logging prediction.

Download Full-text

Logging Analysis and Prediction in Open Source Java Project

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Optimizing Contemporary Application and Processes in Open Source Software ◽

10.4018/978-1-5225-5314-4.ch003 ◽

2018 ◽

pp. 57-85

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Content Analysis ◽

Software Development ◽

Anomaly Detection ◽

Open Source ◽

Large Scale ◽

Source Code ◽

Scale Analysis ◽

Large Scale Analysis ◽

Research Questions

Log statements present in source code provide important information to the software developers because they are useful in various software development activities such as debugging, anomaly detection, and remote issue resolution. Most of the previous studies on logging analysis and prediction provide insights and results after analyzing only a few code constructs. In this chapter, the authors perform an in-depth, focused, and large-scale analysis of logging code constructs at two levels: the file level and catch-blocks level. They answer several research questions related to statistical and content analysis. Statistical and content analysis reveals the presence of differentiating properties among logged and nonlogged code constructs. Based on these findings, the authors propose a machine-learning-based model for catch-blocks logging prediction. The machine-learning-based model is found to be effective in catch-blocks logging prediction.

Download Full-text

The Relationship Between Code Smells and Traceable Patterns — Are They Measuring the Same Thing?

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194017400095 ◽

2017 ◽

Vol 27 (09n10) ◽

pp. 1529-1547 ◽

Cited By ~ 1

Author(s):

Zadia Codabux ◽

Kazi Zakia Sultana ◽

Byron J. Williams

Keyword(s):

Web Applications ◽

Source Code ◽

Code Smells ◽

Class Level ◽

Micro Pattern ◽

Code Smell ◽

The Relationship ◽

Apache Tomcat ◽

Structural Code ◽

Micro Patterns

It is important to maintain software quality as a software system evolves. Managing code smells in source code contributes towards quality software. While metrics have been used to pinpoint code smells in source code, we present an empirical study on the correlation of code smells with class-level (micro pattern) and method-level (nano-pattern) traceable code patterns. This study explores the relationship between code smells and class-level and method-level structural code constructs. We extracted micro patterns at the class level and nano-patterns at the method level from three versions of Apache Tomcat, three versions of Apache CXF and two J2EE web applications namely PersonalBlog and Roller from Stanford SecuriBench and then compared their distributions in code smell versus noncode smell classes and methods. We found that Immutable and Sink micro patterns are more frequent in classes having code smells compared to the noncode smell classes in the applications we analyzed. On the other hand, LocalReader and LocalWriter nano-patterns are more frequent in code smell methods compared to the noncode smell methods. We conclude that code smells are correlated with both micro and nano-patterns.

Download Full-text

The Third Front: Defence Industrialization in the Chinese Interior

The China Quarterly ◽

10.1017/s030574100002748x ◽

1988 ◽

Vol 115 ◽

pp. 351-386 ◽

Cited By ~ 146

Author(s):

Barry Naughton

Keyword(s):

Cultural Revolution ◽

Large Scale ◽

Negative Impact ◽

Political Conflict ◽

Western China ◽

Small Scale ◽

The Third ◽

National Leaders ◽

1960S And 1970S ◽

The Cultural Revolution

Between 1964 and 1971 China carried out a massive programme of investment in the remote regions of south-western and western China. This development programme – called “the Third Front” – envisaged the creation of a huge self-sufficient industrial base area to serve as a strategic reserve in the event of China being drawn into war. Reflecting its primarily military orientation, the programme was considered top secret for many years; recent Chinese articles have discussed the huge costs and legacy of problems associated with the programme, but these discussions have been oblique and anecdotal, and no systematic appraisal has ever been published.2 Since Chinese analysts have avoided discussion of the Third Front, western accounts of China's development have also given it inadequate emphasis, and it has not been incorporated into our understanding of China during the 1960s and 1970s. It is common to assume that the “Cultural Revolution decade” was dominated by domestic political conflict, and characterized by an economic system made dysfunctional by excessive politicization, fragmented control, and an emphasis on small-scale locally self-sufficient development. The Third Front, however, was a purposive, large-scale, centrally-directed programme of development carried out in response to a perceived external threat with the broad support of China's national leaders. Moreover, this programme was immensely costly, having a negative impact on China's economic development that was certainly more far-reaching than the disruption of the Cultural Revolution.

Download Full-text

A Simple and Efficient Pipeline for Construction, Merging, Expansion, and Simulation of Large-Scale, Single-Cell Mechanistic Models

10.1101/2020.11.09.373407 ◽

2020 ◽

Author(s):

Cemal Erdem ◽

Ethan M. Bensman ◽

Arnab Mutsuddy ◽

Michael M. Saint-Antoine ◽

Mehdi Bouhaddou ◽

...

Keyword(s):

Machine Learning ◽

Data Integration ◽

Single Cell ◽

Large Scale ◽

Mechanistic Modeling ◽

Small Scale ◽

Test Case ◽

Mechanistic Models ◽

Biomedical Data ◽

Egf Receptors

ABSTRACTThe current era of big biomedical data accumulation and availability brings data integration opportunities for leveraging its totality to make new discoveries and/or clinically predictive models. Black-box statistical and machine learning methods are powerful for such integration, but often cannot provide mechanistic reasoning, particularly on the single-cell level. While single-cell mechanistic models clearly enable such reasoning, they are predominantly “small-scale”, and struggle with the scalability and reusability required for meaningful data integration. Here, we present an open-source pipeline for scalable, single-cell mechanistic modeling from simple, annotated input files that can serve as a foundation for mechanistic data integration. As a test case, we convert one of the largest existing single-cell mechanistic models to this format, demonstrating robustness and reproducibility of the approach. We show that the model cell line context can be changed with simple replacement of input file parameter values. We next use this new model to test alternative mechanistic hypotheses for the experimental observations that interferon-gamma (IFNG) inhibits epidermal growth factor (EGF)-induced cell proliferation. Model- based analysis suggested, and experiments support that these observations are better explained by IFNG-induced SOCS1 expression sequestering activated EGF receptors, thereby downregulating AKT activity, as opposed to direct IFNG-induced upregulation of p21 expression. Overall, this new pipeline enables large-scale, single-cell, and mechanistically-transparent modeling as a data integration modality complementary to machine learning.

Download Full-text

Exploring essential variables in the settlement selection for small-scale maps using machine learning

Abstracts of the ICA ◽

10.5194/ica-abs-1-162-2019 ◽

2019 ◽

Vol 1 ◽

pp. 1-2 ◽

Cited By ~ 1

Author(s):

Izabela Karsznia ◽

Karolina Sielicka

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Spatial Data ◽

Large Scale ◽

Selection Process ◽

Small Scale ◽

Learning Material ◽

Optimal Sequence ◽

Small Scales ◽

New Variables

Abstract. The decision about removing or maintaining an object while changing detail level requires taking into account many features of the object itself and its surrounding. Automatic generalization is the optimal way to obtain maps at various scales, based on a single spatial database, storing up-to-date information with a high level of spatial accuracy. Researchers agree on the need for fully automating the generalization process (Stoter et al., 2016). Numerous research centres, cartographic agencies as well as commercial companies have undertaken successful attempts of implementing certain generalization solutions (Stoter et al., 2009, 2014, 2016; Regnauld, 2015; Burghardt et al., 2008; Chaundhry and Mackaness, 2008). Nevertheless, an effective and consistent methodology for generalizing small-scale maps has not gained enough attention so far, as most of the conducted research has focused on the acquisition of large-scale maps (Stoter et al., 2016). The presented research aims to fulfil this gap by exploring new variables, which are of the key importance in the automatic settlement selection process at small scales. Addressing this issue is an essential step to propose new algorithms for effective and automatic settlement selection that will contribute to enriching, the sparsely filled small-scale generalization toolbox.The main idea behind this research is using machine learning (ML) for the new variable exploration which can be important in the automatic settlement generalization in small-scales. For automation of the generalization process, cartographic knowledge has to be collected and formalized. So far, a few approaches based on the use of ML have already been proposed. One of the first attempts to determine generalization parameters with the use of ML was performed by Weibel et al. (1995). The learning material was the observation of cartographers manual work. Also, Mustière tried to identify the optimal sequence of the generalization operators for the roads using ML (1998). A different approach was presented by Sester (2000). The goal was to extract the cartographic knowledge from spatial data characteristics, especially from the attributes and geometric properties of objects, regularities and repetitive patterns that govern object selection with the use of decision trees. Lagrange et al. (2000), Balboa and López (2008) also used ML techniques, namely neural networks to generalize line objects. Recently, Sester et al. (2018) proposed the application of deep learning for the task of building generalization. As noticed by Sester et al. (2018), these ideas, although interesting, remained proofs of concepts only. Moreover, they concerned topographic databases and large-scale maps. Promising results of automatic settlement selection in small scales was reported by Karsznia and Weibel (2018). To improve the settlement selection process, they have used data enrichment and ML. Thanks to classification models based on the decision trees, they explored new variables that are decisive in the settlement selection process. However, they have also concluded that there is probably still more “deep knowledge” to be discovered, possibly linked to further variables that were not included in their research. Thus the motivation for this research is to fulfil this research gap and look for additional, essential variables governing settlement selection in small scales.

Download Full-text