A Hybrid Approach to Identify Code Smell Using Machine Learning Algorithms

Code smell aims to identify bugs that occurred during software development. It is the task of identifying design problems. The significant causes of code smell are complexity in code, violation of programming rules, low modelling, and lack of unit-level testing by the developer. Different open source systems like JEdit, Eclipse, and ArgoUML are evaluated in this work. After collecting the data, the best features are selected using recursive feature elimination (RFE). In this paper, the authors have used different anomaly detection algorithms for efficient recognition of dirty code. The average accuracy value of k-means, GMM, autoencoder, PCA, and Bayesian networks is 98%, 94%, 96%, 89%, and 93%. The k-means clustering algorithm is the most suitable algorithm for code detection. Experimentally, the authors proved that ArgoUML project is having better performance as compared to Eclipse and JEdit projects.

Download Full-text

PredNTS: Improved and Robust Prediction of Nitrotyrosine Sites by Integrating Multiple Sequence Features

International Journal of Molecular Sciences ◽

10.3390/ijms22052704 ◽

2021 ◽

Vol 22 (5) ◽

pp. 2704

Author(s):

Andi Nur Nilamyani ◽

Firda Nurul Auliah ◽

Mohammad Ali Moni ◽

Watshara Shoombuatong ◽

Md Mehedi Hasan ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Web Application ◽

Computational Prediction ◽

Vital Role ◽

Machine Learning Algorithms ◽

Recursive Feature Elimination ◽

Post Translational Modification ◽

Multiple Sequence ◽

Sequence Features

Nitrotyrosine, which is generated by numerous reactive nitrogen species, is a type of protein post-translational modification. Identification of site-specific nitration modification on tyrosine is a prerequisite to understanding the molecular function of nitrated proteins. Thanks to the progress of machine learning, computational prediction can play a vital role before the biological experimentation. Herein, we developed a computational predictor PredNTS by integrating multiple sequence features including K-mer, composition of k-spaced amino acid pairs (CKSAAP), AAindex, and binary encoding schemes. The important features were selected by the recursive feature elimination approach using a random forest classifier. Finally, we linearly combined the successive random forest (RF) probability scores generated by the different, single encoding-employing RF models. The resultant PredNTS predictor achieved an area under a curve (AUC) of 0.910 using five-fold cross validation. It outperformed the existing predictors on a comprehensive and independent dataset. Furthermore, we investigated several machine learning algorithms to demonstrate the superiority of the employed RF algorithm. The PredNTS is a useful computational resource for the prediction of nitrotyrosine sites. The web-application with the curated datasets of the PredNTS is publicly available.

Download Full-text

A new method for unveiling open clusters in Gaia

Astronomy and Astrophysics ◽

10.1051/0004-6361/201833390 ◽

2018 ◽

Vol 618 ◽

pp. A59 ◽

Cited By ~ 36

Author(s):

A. Castro-Ginard ◽

C. Jordi ◽

X. Luri ◽

F. Julbe ◽

M. Morvan ◽

...

Keyword(s):

Clustering Algorithm ◽

Dimensional Space ◽

Open Clusters ◽

Machine Learning Algorithms ◽

New Era ◽

Use Of Data ◽

Manual Intervention ◽

Density Based Clustering ◽

Artificial Neural Network Ann ◽

Astrometric Data

Context. The publication of the Gaia Data Release 2 (Gaia DR2) opens a new era in astronomy. It includes precise astrometric data (positions, proper motions, and parallaxes) for more than 1.3 billion sources, mostly stars. To analyse such a vast amount of new data, the use of data-mining techniques and machine-learning algorithms is mandatory. Aims. A great example of the application of such techniques and algorithms is the search for open clusters (OCs), groups of stars that were born and move together, located in the disc. Our aim is to develop a method to automatically explore the data space, requiring minimal manual intervention. Methods. We explore the performance of a density-based clustering algorithm, DBSCAN, to find clusters in the data together with a supervised learning method such as an artificial neural network (ANN) to automatically distinguish between real OCs and statistical clusters. Results. The development and implementation of this method in a five-dimensional space (l, b, ϖ, μα*, μδ) with the Tycho-Gaia Astrometric Solution (TGAS) data, and a posterior validation using Gaia DR2 data, lead to the proposal of a set of new nearby OCs. Conclusions. We have developed a method to find OCs in astrometric data, designed to be applied to the full Gaia DR2 archive.

Download Full-text

Development and Research of the Hybrid Approach to the Solution of Optimization Design Problems

Advances in Intelligent Systems and Computing - Proceedings of the Third International Scientific Conference “Intelligent Information Technologies for Industry” (IITI’18) ◽

10.1007/978-3-030-01821-4_26 ◽

2018 ◽

pp. 246-257

Author(s):

Leonid A. Gladkov ◽

Nadezhda V. Gladkova ◽

Sergey N. Leiba ◽

Nikolay E. Strakhov

Keyword(s):

Optimization Design ◽

Hybrid Approach ◽

Design Problems

Download Full-text

Scalable hierarchical clustering by composition rank vector encoding and tree structure

10.1101/2020.04.12.038026 ◽

2020 ◽

Author(s):

Xiao Lai ◽

Pu Tian

Keyword(s):

Machine Learning ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Machine Learning Algorithms ◽

Tree Structure ◽

Supervised Machine Learning ◽

High Dimensional ◽

Rank Vector ◽

Nonlinear Correlations

AbstractSupervised machine learning, especially deep learning based on a wide variety of neural network architectures, have contributed tremendously to fields such as marketing, computer vision and natural language processing. However, development of un-supervised machine learning algorithms has been a bottleneck of artificial intelligence. Clustering is a fundamental unsupervised task in many different subjects. Unfortunately, no present algorithm is satisfactory for clustering of high dimensional data with strong nonlinear correlations. In this work, we propose a simple and highly efficient hierarchical clustering algorithm based on encoding by composition rank vectors and tree structure, and demonstrate its utility with clustering of protein structural domains. No record comparison, which is an expensive and essential common step to all present clustering algorithms, is involved. Consequently, it achieves linear time and space computational complexity hierarchical clustering, thus applicable to arbitrarily large datasets. The key factor in this algorithm is definition of composition, which is dependent upon physical nature of target data and therefore need to be constructed case by case. Nonetheless, the algorithm is general and applicable to any high dimensional data with strong nonlinear correlations. We hope this algorithm to inspire a rich research field of encoding based clustering well beyond composition rank vector trees.

Download Full-text

A Hybrid Deep Clustering Approach for Robust Cell Type Profiling Using Single-cell RNA-seq Data

10.1101/511626 ◽

2019 ◽

Cited By ~ 2

Author(s):

Suhas Srinivasan ◽

Nathan T. Johnson ◽

Dmitry Korkin

Keyword(s):

Deep Learning ◽

Single Cell ◽

Clustering Algorithm ◽

Hybrid Approach ◽

Feature Learning ◽

Specific Cell ◽

Clustering Methods ◽

Model Based Clustering ◽

Clustering And Classification ◽

Living Organisms

AbstractSingle-cell RNA sequencing (scRNA-seq) is a recent technology that enables fine-grained discovery of cellular subtypes and specific cell states. It routinely uses machine learning methods, such as feature learning, clustering, and classification, to assist in uncovering novel information from scRNA-seq data. However, current methods are not well suited to deal with the substantial amounts of noise that is created by the experiments or the variation that occurs due to differences in the cells of the same type. Here, we develop a new hybrid approach, Deep Unsupervised Single-cell Clustering (DUSC), that integrates feature generation based on a deep learning architecture with a model-based clustering algorithm, to find a compact and informative representation of the single-cell transcriptomic data generating robust clusters. We also include a technique to estimate an efficient number of latent features in the deep learning model. Our method outperforms both classical and state-of-the-art feature learning and clustering methods, approaching the accuracy of supervised learning. The method is freely available to the community and will hopefully facilitate our understanding of the cellular atlas of living organisms as well as provide the means to improve patient diagnostics and treatment.

Download Full-text

Comparing within- and cross-project machine learning algorithms for code smell detection

10.1145/3472674.3473978 ◽

2021 ◽

Author(s):

Manuel De Stefano ◽

Fabiano Pecorelli ◽

Fabio Palomba ◽

Andrea De Lucia

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Code Smell ◽

Cross Project

Download Full-text

Combining Clustering and Factor Analysis as Complementary Techniques

International Journal of Data Analytics ◽

10.4018/ijda.2020070104 ◽

2020 ◽

Vol 1 (2) ◽

pp. 48-57

Author(s):

Lakshmi Prayaga ◽

Krishna Devulapalli ◽

Chandra Prayaga

Keyword(s):

Factor Analysis ◽

Traffic Management ◽

Clustering Algorithm ◽

Perfect Match ◽

Machine Learning Algorithms ◽

Insurance Companies ◽

Data Set ◽

Traffic Violations ◽

Traffic Violation ◽

Complementary Techniques

The study of driver behavior and associated accidents has been of interest to researchers and insurance companies. From the perspective of insurance companies, identifying factors that contribute to traffic violations plays a significant role in providing insurance quotes as it establishes the basis for charging appropriate insurance rates to customers. This study assesses the traffic violations intensity for 64 counties in the state of Florida, USA by using the publicly available traffic violations data set. This data set consists of 3,669,796 records with 11 attributes, which include race, gender, driver's age, type of driving violation, etc. The 187 types of traffic violations are categorized into 11 broad traffic violations categories. Two machine learning algorithms, factor analysis and k-means clustering, were applied in this study. After applying factor analysis, a new comprehensive traffic violation index (TVI) was developed, which quantified the traffic violation intensity of each county. All the counties in the data set were ranked with the TVI scores, and the counties with high TVI scores were identified. K-means clustering algorithm was then applied to the same data, and four clusters of counties were derived. The counties that were grouped in each cluster were compared with the TVI scores to check if the counties in each cluster had similar TVI scores. The counties with the highest TVI scores are found to be grouped in one cluster, followed by counties with the next high TVI scores in the second cluster, and so on. Thus, it is observed that there is a perfect match in the results of both models. They serve as two techniques complementary to each other, in that the k-means clustering method groups counties with comparable traffic violation intensities and factor analysis is able to also rank individual counties according to the TVI. These techniques have identified the counties with high traffic violation intensities, which helps the policymakers to take adequate measures for traffic management.

Download Full-text

Towards Near-Real-Time Intrusion Detection for IoT Devices using Supervised Learning and Apache Spark

Electronics ◽

10.3390/electronics9030444 ◽

2020 ◽

Vol 9 (3) ◽

pp. 444 ◽

Cited By ~ 1

Author(s):

Valerio Morfino ◽

Salvatore Rampone

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Algorithms ◽

Hybrid Approach ◽

Cyber Attacks ◽

Machine Learning Algorithms ◽

Apache Spark ◽

Identification Accuracy ◽

Supervised Machine Learning ◽

Iot Devices

In the fields of Internet of Things (IoT) infrastructures, attack and anomaly detection are rising concerns. With the increased use of IoT infrastructure in every domain, threats and attacks in these infrastructures are also growing proportionally. In this paper the performances of several machine learning algorithms in identifying cyber-attacks (namely SYN-DOS attacks) to IoT systems are compared both in terms of application performances, and in training/application times. We use supervised machine learning algorithms included in the MLlib library of Apache Spark, a fast and general engine for big data processing. We show the implementation details and the performance of those algorithms on public datasets using a training set of up to 2 million instances. We adopt a Cloud environment, emphasizing the importance of the scalability and of the elasticity of use. Results show that all the Spark algorithms used result in a very good identification accuracy (>99%). Overall, one of them, Random Forest, achieves an accuracy of 1. We also report a very short training time (23.22 sec for Decision Tree with 2 million rows). The experiments also show a very low application time (0.13 sec for over than 600,000 instances for Random Forest) using Apache Spark in the Cloud. Furthermore, the explicit model generated by Random Forest is very easy-to-implement using high- or low-level programming languages. In light of the results obtained, both in terms of computation times and identification performance, a hybrid approach for the detection of SYN-DOS cyber-attacks on IoT devices is proposed: the application of an explicit Random Forest model, implemented directly on the IoT device, along with a second level analysis (training) performed in the Cloud.

Download Full-text

Data Mining Itemset of Big Data Using Pre-Processing Based on Mapreduce FrameWork with ETL Tools

APTIKOM Journal on Computer Science and Information Technologies ◽

10.11591/aptikom.j.csit.103 ◽

2017 ◽

Vol 2 (2) ◽

pp. 57-62

Author(s):

Padmanathan Anantharaman ◽

H.V. Ramakrishan

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

Programming Model ◽

Hybrid Approach ◽

Processing Technique ◽

Frequent Itemsets ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining ◽

Dataset Size

As data volumes continue to grow, they quickly consume the capacity of data warehouses and application databases. Is your IT organization forced into costly upgrades to expensive databases and data warehouse hardware appliances and enormous amount of data is getting explored through Internet of Things (IoT) as technologies are advancing and people uses these technologies in day to day activities, this data is termed as Big Data having its characteristics and challenges. Frequent Itemset Mining algorithms are aimed to disclose frequent itemsets from transactional database but as the dataset size increases, it cannot be handled by traditional frequent itemset mining. MapReduce programming model solves the problem of large datasets but it has large communication cost which reduces execution efficiency. This proposed new pre-processed k-means technique applied on BigFIM algorithm. ClustBigFIM uses hybrid approach, clustering using k-means algorithm to generate Clusters from huge datasets and Apriori and Eclat to mine frequent itemsets from generated clusters using MapReduce programming model. Results shown that execution efficiency of ClustBigFIM algorithm is increased by applying k-means clustering algorithm before BigFIM algorithm as one of the pre-processing technique.

Download Full-text

AUTOMATED DIAGNOSIS OF BRAIN TUMOURS ASTROCYTOMAS USING PROBABILISTIC NEURAL NETWORK CLUSTERING AND SUPPORT VECTOR MACHINES

International Journal of Neural Systems ◽

10.1142/s0129065705000013 ◽

2005 ◽

Vol 15 (01n02) ◽

pp. 1-11 ◽

Cited By ~ 26

Author(s):

DIMITRIS GLOTSOS ◽

JUSSI TOHKA ◽

PANAGIOTA RAVAZOULA ◽

DIONISIS CAVOURAS ◽

GEORGE NIKIFORIDIS

Keyword(s):

Neural Network ◽

Clustering Algorithm ◽

Probabilistic Neural Network ◽

Support Vector ◽

Network Clustering ◽

High Grade ◽

Cell Nuclei ◽

Decision Tree Classification ◽

Average Accuracy ◽

Malignancy Grading

A computer-aided diagnosis system was developed for assisting brain astrocytomas malignancy grading. Microscopy images from 140 astrocytic biopsies were digitized and cell nuclei were automatically segmented using a Probabilistic Neural Network pixel-based clustering algorithm. A decision tree classification scheme was constructed to discriminate low, intermediate and high-grade tumours by analyzing nuclear features extracted from segmented nuclei with a Support Vector Machine classifier. Nuclei were segmented with an average accuracy of 86.5%. Low, intermediate, and high-grade tumours were identified with 95%, 88.3%, and 91% accuracies respectively. The proposed algorithm could be used as a second opinion tool for the histopathologists.

Download Full-text