Computation of Program Source Code Similarity by Composition of Parse Tree and Call Graph

This paper proposes a novel method to compute how similar two program source codes are. Since a program source code is represented as a structural form, the proposed method adopts convolution kernel functions as a similarity measure. Actually, a program source code has two kinds of structural information. One is syntactic information and the other is the dependencies of function calls lying on the program. Since the syntactic information of a program is expressed as its parse tree, the syntactic similarity between two programs is computed by a parse tree kernel. The function calls within a program provide a global structure of a program and can be represented as a graph. Therefore, the similarity of function calls is computed with a graph kernel. Then, both structural similarities are reflected simultaneously into comparing program source codes by composing the parse tree and the graph kernels based on a cyclomatic complexity. According to the experimental results on a real data set for program plagiarism detection, the proposed method is proved to be effective in capturing the similarity between programs. The experiments show that the plagiarized pairs of programs are found correctly and thoroughly by the proposed method.

Download Full-text

Online Judging Platform Utilizing Dynamic Plagiarism Detection Facilities

Computers ◽

10.3390/computers10040047 ◽

2021 ◽

Vol 10 (4) ◽

pp. 47

Author(s):

Fariha Iffath ◽

A. S. M. Kayes ◽

Md. Tahsin Rahman ◽

Jannatul Ferdows ◽

Mohammad Shamsul Arefin ◽

...

Keyword(s):

Source Code ◽

Large Data ◽

Large Data Sets ◽

Detection Technique ◽

Data Sets ◽

Plagiarism Detection ◽

Source Codes ◽

Efficient Detection ◽

Mathematical Problems ◽

Automatic Scoring

A programming contest generally involves the host presenting a set of logical and mathematical problems to the contestants. The contestants are required to write computer programs that are capable of solving these problems. An online judge system is used to automate the judging procedure of the programs that are submitted by the users. Online judges are systems designed for the reliable evaluation of the source codes submitted by the users. Traditional online judging platforms are not ideally suitable for programming labs, as they do not support partial scoring and efficient detection of plagiarized codes. When considering this fact, in this paper, we present an online judging framework that is capable of automatic scoring of codes by detecting plagiarized contents and the level of accuracy of codes efficiently. Our system performs the detection of plagiarism by detecting fingerprints of programs and using the fingerprints to compare them instead of using the whole file. We used winnowing to select fingerprints among k-gram hash values of a source code, which was generated by the Rabin–Karp Algorithm. The proposed system is compared with the existing online judging platforms to show the superiority in terms of time efficiency, correctness, and feature availability. In addition, we evaluated our system by using large data sets and comparing the run time with MOSS, which is the widely used plagiarism detection technique.

Download Full-text

Similarity Detection in Large Volume Data using Machine Learning Techniques

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c3987.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 735-739

Keyword(s):

Machine Learning ◽

Big Data ◽

Source Code ◽

Computational Cost ◽

Machine Learning Techniques ◽

Volume Data ◽

Plagiarism Detection ◽

Computationally Efficient ◽

Source Codes ◽

Similarity Detection

When unauthorized copying or stealing of intellectual properties of others happen, it is called plagiarism. Two main approaches are used to counter this problem – external plagiarism detection and intrinsic plagiarism detection. External algorithms compare a suspicious file with numerous sources whereas intrinsic algorithms are allowed to solely inspect the suspicious file in order to predict plagiarism. In this work, the area chosen for detecting plagiarism is with programs or source code files. Copying the entire source code or logic used in a particular program without permissions or copyright is the stealing that happens in the case of source codes. There exist many ways to detect plagiarism in source code files. To perform plagiarism checking for a large dataset, the computational cost is very high and moreover it’s a time consuming job. To achieve a computationally efficient similarity detection in source code files, the Hadoop framework is used where parallel computation is possible for large datasets. But the raw data available to us is not in a suitable form for the existing plagiarism checking tools to work with, as their size is too high and they possess features of big data. Thus a qualifying model is required for the dataset, to be fed into Hadoop so that it could efficiently process them to check for plagiarism in source codes. To generate such a model, machine learning is used which incorporates big data with machine learning

Download Full-text

2.5D multifocusing imaging of crooked-line seismic surveys

Geophysics ◽

10.1190/geo2020-0660.1 ◽

2021 ◽

pp. 1-67

Author(s):

Hossein Jodeiri Akbari Fam ◽

Mostafa Naghizadeh ◽

Oz Yilmaz

Keyword(s):

Seismic Data ◽

Structural Information ◽

Signal To Noise Ratio ◽

Real Data ◽

Line Geometry ◽

Data Set ◽

Seismic Surveys ◽

Seismic Image ◽

New Formulation ◽

Optimization Search

Two-dimensional seismic surveys often are conducted along crooked line traverses due to the inaccessibility of rugged terrains, logistical and environmental restrictions, and budget limitations. The crookedness of line traverses, irregular topography, and complex subsurface geology with steeply dipping and curved interfaces could adversely affect the signal-to-noise ratio of the data. The crooked-line geometry violates the assumption of a straight-line survey that is a basic principle behind the 2D multifocusing (MF) method and leads to crossline spread of midpoints. Additionally, the crooked-line geometry can give rise to potential pitfalls and artifacts, thus, leads to difficulties in imaging and velocity-depth model estimation. We develop a novel multifocusing algorithm for crooked-line seismic data and revise the traveltime equation accordingly to achieve better signal alignment before stacking. Specifically, we present a 2.5D multifocusing reflection traveltime equation, which explicitly takes into account the midpoint dispersion and cross-dip effects. The new formulation corrects for normal, inline, and crossline dip moveouts simultaneously, which is significantly more accurate than removing these effects sequentially. Applying NMO, DMO, and CDMO separately tends to result in significant errors, especially for large offsets. The 2.5D multifocusing method can perform automatically with a coherence-based global optimization search on data. We investigated the accuracy of the new formulation by testing it on different synthetic models and a real seismic data set. Applying the proposed approach to the real data led to a high-resolution seismic image with a significant quality improvement compared to the conventional method. Numerical tests show that the new formula can accurately focus the primary reflections at their correct location, remove anomalous dip-dependent velocities, and extract true dips from seismic data for structural interpretation. The proposed method efficiently projects and extracts valuable 3D structural information when applied to crooked-line seismic surveys.

Download Full-text

An Organized Repository of Ethereum Smart Contracts’ Source Codes and Metrics

Future Internet ◽

10.3390/fi12110197 ◽

2020 ◽

Vol 12 (11) ◽

pp. 197

Author(s):

Giuseppe Antonio Pierro ◽

Roberto Tonelli ◽

Michele Marchesi

Keyword(s):

Software Engineering ◽

Software Metrics ◽

Source Code ◽

Empirical Software Engineering ◽

Specific Information ◽

Smart Contracts ◽

Data Set ◽

Source Codes ◽

Engineering Studies ◽

Smart Contract

Many empirical software engineering studies show that there is a need for repositories where source codes are acquired, filtered and classified. During the last few years, Ethereum block explorer services have emerged as a popular project to explore and search for Ethereum blockchain data such as transactions, addresses, tokens, smart contracts’ source codes, prices and other activities taking place on the Ethereum blockchain. Despite the availability of this kind of service, retrieving specific information useful to empirical software engineering studies, such as the study of smart contracts’ software metrics, might require many subtasks, such as searching for specific transactions in a block, parsing files in HTML format, and filtering the smart contracts to remove duplicated code or unused smart contracts. In this paper, we afford this problem by creating Smart Corpus, a corpus of smart contracts in an organized, reasoned and up-to-date repository where Solidity source code and other metadata about Ethereum smart contracts can easily and systematically be retrieved. We present Smart Corpus’s design and its initial implementation, and we show how the data set of smart contracts’ source codes in a variety of programming languages can be queried and processed to get useful information on smart contracts and their software metrics. Smart Corpus aims to create a smart-contract repository where smart-contract data (source code, application binary interface (ABI) and byte code) are freely and immediately available and are classified based on the main software metrics identified in the scientific literature. Smart contracts’ source codes have been validated by EtherScan, and each contract comes with its own associated software metrics as computed by the freely available software PASO. Moreover, Smart Corpus can be easily extended as the number of new smart contracts increases day by day.

Download Full-text

Novel Code Plagiarism Detection Based on Abstract Syntax Tree and Fuzzy Petri Nets

International Journal of Engineering Education ◽

10.14710/ijee.1.1.46-56 ◽

2019 ◽

Vol 1 (1) ◽

pp. 46-56 ◽

Cited By ~ 1

Author(s):

Victor R. L. Shen

Keyword(s):

Programming Languages ◽

Source Code ◽

Learning Performance ◽

Abstract Syntax ◽

Plagiarism Detection ◽

Abstract Syntax Tree ◽

Source Codes ◽

Syntax Tree ◽

Fuzzy Petri Nets ◽

High Level

Those students who major in computer science and/or engineering are required to design program codes in a variety of programming languages. However, many students submit their source codes they get from the Internet or friends with no or few modifications. Detecting the code plagiarisms done by students is very time-consuming and leads to the problems of unfair learning performance evaluation. This paper proposes a novel method to detect the source code plagiarisms by using a high-level fuzzy Petri net (HLFPN) based on abstract syntax tree (AST). First, the AST of each source code is generated after the lexical and syntactic analyses have been done. Second, token sequence is generated based on the AST. Using the AST can effectively detect the code plagiarism by changing the identifier or program statement order. Finally, the generated token sequences are compared with one another using an HLFPN to determine the code plagiarism. Furthermore, the experimental results have indicated that we can make better determination to detect the code plagiarism.

Download Full-text

EMCAT: An automatic image-processing system for electron-microscopic tomography

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100169821 ◽

1994 ◽

Vol 52 ◽

pp. 418-419

Author(s):

Weiping Liu ◽

John W. Sedat ◽

David A. Agard

Keyword(s):

Image Processing ◽

Structural Information ◽

Imaging Technique ◽

Three Dimensional ◽

Processing System ◽

Electron Microscopic ◽

Data Set ◽

Ordered Structures ◽

Automatic Image Processing ◽

Em Tomography

Any real world object is three-dimensional. The principle of tomography, which reconstructs the 3-D structure of an object from its 2-D projections of different view angles has found application in many disciplines. Electron Microscopic (EM) tomography on non-ordered structures (e.g., subcellular structures in biology and non-crystalline structures in material science) has been exercised sporadically in the last twenty years or so. As vital as is the 3-D structural information and with no existing alternative 3-D imaging technique to compete in its high resolution range, the technique to date remains the kingdom of a brave few. Its tedious tasks have been preventing it from being a routine tool. One keyword in promoting its popularity is automation: The data collection has been automated in our lab, which can routinely yield a data set of over 100 projections in the matter of a few hours. Now the image processing part is also automated. Such automations finish the job easier, faster and better.

Download Full-text

EXPONENTIATED HALF-LOGISTIC LOMAX DISTRIBUTION WITH PROPERTIES AND APPLICATION

NED University Journal of Research ◽

10.35453/nedjr-ascn-2018-0033 ◽

2019 ◽

Vol XVI (2) ◽

pp. 1-11

Author(s):

Farrukh Jamal ◽

Hesham Mohammed Reyad ◽

Soha Othman Ahmed ◽

Muhammad Akbar Ali Shah ◽

Emrah Altun

Keyword(s):

Real Data ◽

Continuous Model ◽

Model Parameters ◽

Data Set ◽

Lomax Distribution ◽

Mathematical Properties ◽

Proposed Model ◽

Probability Weighted Moment ◽

Record Statistics ◽

Maximum Likelihood Criterion

A new three-parameter continuous model called the exponentiated half-logistic Lomax distribution is introduced in this paper. Basic mathematical properties for the proposed model were investigated which include raw and incomplete moments, skewness, kurtosis, generating functions, Rényi entropy, Lorenz, Bonferroni and Zenga curves, probability weighted moment, stress strength model, order statistics, and record statistics. The model parameters were estimated by using the maximum likelihood criterion and the behaviours of these estimates were examined by conducting a simulation study. The applicability of the new model is illustrated by applying it on a real data set.

Download Full-text

Evaluation for estimating of the PDF and the CDF of Generalized Inverted Exponential Distribution with Application in Industry

Advances in Mathematics: Scientific Journal ◽

10.37418/amsj.9.1.39 ◽

2020 ◽

pp. 507-522

Author(s):

Parisa Torkaman

Keyword(s):

Least Squares ◽

Exponential Distribution ◽

Mean Squared Error ◽

Weighted Least Squares ◽

Real Data ◽

Minimum Variance ◽

Cumulative Distribution ◽

Estimation Methods ◽

Data Set ◽

Better Than

The generalized inverted exponential distribution is introduced as a lifetime model with good statistical properties. This paper, the estimation of the probability density function and the cumulative distribution function of with five different estimation methods: uniformly minimum variance unbiased(UMVU), maximum likelihood(ML), least squares(LS), weighted least squares (WLS) and percentile(PC) estimators are considered. The performance of these estimation procedures, based on the mean squared error (MSE) by numerical simulations are compared. Simulation studies express that the UMVU estimator performs better than others and when the sample size is large enough the ML and UMVU estimators are almost equivalent and efficient than LS, WLS and PC. Finally, the result using a real data set are analyzed.

Download Full-text

HCVS: Pinpointing Chromatin States Through Hierarchical Clustering and Visualization Scheme

Current Bioinformatics ◽

10.2174/1574893613666180402141107 ◽

2019 ◽

Vol 14 (2) ◽

pp. 148-156

Author(s):

Nighat Noureen ◽

Sahar Fazal ◽

Muhammad Abdul Qadir ◽

Muhammad Tanvir Afzal

Keyword(s):

Hierarchical Clustering ◽

Real Data ◽

Cell Types ◽

Computational Scheme ◽

Data Set ◽

Chromatin States ◽

Functional Regions ◽

Visualization Strategy ◽

Hidden States ◽

Next Generation Sequencing Ngs

Background: Specific combinations of Histone Modifications (HMs) contributing towards histone code hypothesis lead to various biological functions. HMs combinations have been utilized by various studies to divide the genome into different regions. These study regions have been classified as chromatin states. Mostly Hidden Markov Model (HMM) based techniques have been utilized for this purpose. In case of chromatin studies, data from Next Generation Sequencing (NGS) platforms is being used. Chromatin states based on histone modification combinatorics are annotated by mapping them to functional regions of the genome. The number of states being predicted so far by the HMM tools have been justified biologically till now. Objective: The present study aimed at providing a computational scheme to identify the underlying hidden states in the data under consideration. </P><P> Methods: We proposed a computational scheme HCVS based on hierarchical clustering and visualization strategy in order to achieve the objective of study. Results: We tested our proposed scheme on a real data set of nine cell types comprising of nine chromatin marks. The approach successfully identified the state numbers for various possibilities. The results have been compared with one of the existing models as well which showed quite good correlation. Conclusion: The HCVS model not only helps in deciding the optimal state numbers for a particular data but it also justifies the results biologically thereby correlating the computational and biological aspects.

Download Full-text

Implementation of a Modified Faster R-CNN for Target Detection Technology of Coastal Defense Radar

Remote Sensing ◽

10.3390/rs13091703 ◽

2021 ◽

Vol 13 (9) ◽

pp. 1703

Author(s):

He Yan ◽

Chao Chen ◽

Guodong Jin ◽

Jindong Zhang ◽

Xudong Wang ◽

...

Keyword(s):

False Alarm ◽

False Alarm Rate ◽

Target Detection ◽

Real Data ◽

Detection Performance ◽

Detection Accuracy ◽

Constant False Alarm Rate ◽

Data Set ◽

Detection Technology ◽

Coastal Defense

The traditional method of constant false-alarm rate detection is based on the assumption of an echo statistical model. The target recognition accuracy rate and the high false-alarm rate under the background of sea clutter and other interferences are very low. Therefore, computer vision technology is widely discussed to improve the detection performance. However, the majority of studies have focused on the synthetic aperture radar because of its high resolution. For the defense radar, the detection performance is not satisfactory because of its low resolution. To this end, we herein propose a novel target detection method for the coastal defense radar based on faster region-based convolutional neural network (Faster R-CNN). The main processing steps are as follows: (1) the Faster R-CNN is selected as the sea-surface target detector because of its high target detection accuracy; (2) a modified Faster R-CNN based on the characteristics of sparsity and small target size in the data set is employed; and (3) soft non-maximum suppression is exploited to eliminate the possible overlapped detection boxes. Furthermore, detailed comparative experiments based on a real data set of coastal defense radar are performed. The mean average precision of the proposed method is improved by 10.86% compared with that of the original Faster R-CNN.

Download Full-text