Plagiarism Detection in Programming Assignments using Machine Learning

Plagiarism in programming assignments has been increasing these days which affects the evaluation of students. Thispaper proposes a machine learning approach for plagiarism detection of programming assignments. Different features related to source code are computed based on similarity score of n-grams, code style similarity and dead codes. Then, xgboost model is used for training and predicting whether a pair of source code are plagiarised or not. Many plagiarism techniques ignores dead codes such as unused variables and functions in their predictions tasks. But number of unused variables and functions in the source code are considered in this paper. Using our features, the model achieved an accuracy score of 94% and average f1-score of 0.905 on the test set. We also compared the result of xgboost model with support vector machines(SVM) and report that xgboost model performed better on our dataset.

Download Full-text

Arabic English Cross-Lingual Plagiarism Detection Based on Keyphrases Extraction, Monolingual and Machine Learning Approach

Asian Journal of Research in Computer Science ◽

10.9734/ajrcos/2018/v2i330075 ◽

2019 ◽

pp. 1-12

Author(s):

Mokhtar Al-Suhaiqi ◽

Muneer A. S. Hazaa ◽

Mohammed Albared

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Detection Methods ◽

Support Vector ◽

Svm Classifier ◽

Learning Approach ◽

Plagiarism Detection ◽

Machine Learning Approach ◽

Cross Lingual ◽

Cross Language

Due to rapid growth of research articles in various languages, cross-lingual plagiarism detection problem has received increasing interest in recent years. Cross-lingual plagiarism detection is more challenging task than monolingual plagiarism detection. This paper addresses the problem of cross-lingual plagiarism detection (CLPD) by proposing a method that combines keyphrases extraction, monolingual detection methods and machine learning approach. The research methodology used in this study has facilitated to accomplish the objectives in terms of designing, developing, and implementing an efficient Arabic – English cross lingual plagiarism detection. This paper empirically evaluates five different monolingual plagiarism detection methods namely i)N-Grams Similarity, ii)Longest Common Subsequence, iii)Dice Coefficient, iv)Fingerprint based Jaccard Similarity and v) Fingerprint based Containment Similarity. In addition, three machine learning approaches namely i) naïve Bayes, ii) Support Vector Machine, and iii) linear logistic regression classifiers are used for Arabic-English Cross-language plagiarism detection. Several experiments are conducted to evaluate the performance of the key phrases extraction methods. In addition, Several experiments to investigate the performance of machine learning techniques to find the best method for Arabic-English Cross-language plagiarism detection. According to the experiments of Arabic-English Cross-language plagiarism detection, the highest result was obtained using SVM classifier with 92% f-measure. In addition, the highest results were obtained by all classifiers are achieved, when most of the monolingual plagiarism detection methods are used.

Download Full-text

A novel machine learning approach to the detection of identity theft in social networks based on emulated attack instances and support vector machines

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.3633 ◽

2015 ◽

Vol 28 (4) ◽

pp. 1385-1395 ◽

Cited By ~ 5

Author(s):

E. Villar-Rodríguez ◽

J. Del Ser ◽

A. I. Torre-Bastida ◽

M. N. Bilbao ◽

S. Salcedo-Sanz

Keyword(s):

Machine Learning ◽

Social Networks ◽

Support Vector Machines ◽

Identity Theft ◽

Support Vector ◽

Learning Approach ◽

Vector Machines ◽

Machine Learning Approach

Download Full-text

An Exploration of Impact Factors Influencing Students’ Reading Literacy in Singapore with Machine Learning Approaches

International Journal of English Linguistics ◽

10.5539/ijel.v9n5p52 ◽

2019 ◽

Vol 9 (5) ◽

pp. 52 ◽

Cited By ~ 1

Author(s):

Xin Dong ◽

Jie Hu

Keyword(s):

Machine Learning ◽

Contextual Factors ◽

School Level ◽

Reading Literacy ◽

Support Vector ◽

Learning Approach ◽

Accuracy Score ◽

International Student Assessment ◽

Machine Learning Approach ◽

The Impact

This study identified the contextual factors which differentiated 15-year-old students with high- and low-achieving reading literacy in Singapore based on Program for International Student Assessment (PISA) 2015. 4,015 students from Singapore were collected from the public dataset of PISA 2015, with 2,646 high-achieving students and 1,369 low-achieving students in PISA reading literacy test. The impact of the overall 49 contextual factors on reading literacy was analyzed in three levels: student level, family level and school level. Support vector machine (SVM), a machine learning approach, was applied to analyze these contextual features. It indicated that SVM could effectively distinguish these two cohorts of readers with an accuracy score of 0.78. SVM-based recursive feature elimination (SVM-RFE), another machine learning approach, was then applied to rank these selected features. These features were outputted in descending order with regard to the degree of their significance to the differentiation. At last, an optimal set with 15 contextual factors was selected by RFE-CV (cross validation), which collectively affected the differentiation of students with high- and low-level of reading literacy. Based on the analysis, implications to further improving students’ reading literacy can be achieved.

Download Full-text

Distribution Grids Fault Location employing ST based Optimized Machine Learning Approach

Energies ◽

10.3390/en11092328 ◽

2018 ◽

Vol 11 (9) ◽

pp. 2328 ◽

Cited By ~ 12

Author(s):

Md Shafiullah ◽

M. Abido ◽

Taher Abdel-Fattah

Keyword(s):

Machine Learning ◽

Fault Location ◽

Percentage Error ◽

Support Vector ◽

Learning Approach ◽

Efficiency Coefficient ◽

Learning Tools ◽

Performance Indices ◽

Machine Learning Approach ◽

Distribution Grids

Precise information of fault location plays a vital role in expediting the restoration process, after being subjected to any kind of fault in power distribution grids. This paper proposed the Stockwell transform (ST) based optimized machine learning approach, to locate the faults and to identify the faulty sections in the distribution grids. This research employed the ST to extract useful features from the recorded three-phase current signals and fetches them as inputs to different machine learning tools (MLT), including the multilayer perceptron neural networks (MLP-NN), support vector machines (SVM), and extreme learning machines (ELM). The proposed approach employed the constriction-factor particle swarm optimization (CF-PSO) technique, to optimize the parameters of the SVM and ELM for their better generalization performance. Hence, it compared the obtained results of the test datasets in terms of the selected statistical performance indices, including the root mean squared error (RMSE), mean absolute percentage error (MAPE), percent bias (PBIAS), RMSE-observations to standard deviation ratio (RSR), coefficient of determination (R2), Willmott’s index of agreement (WIA), and Nash–Sutcliffe model efficiency coefficient (NSEC) to confirm the effectiveness of the developed fault location scheme. The satisfactory values of the statistical performance indices, indicated the superiority of the optimized machine learning tools over the non-optimized tools in locating faults. In addition, this research confirmed the efficacy of the faulty section identification scheme based on overall accuracy. Furthermore, the presented results validated the robustness of the developed approach against the measurement noise and uncertainties associated with pre-fault loading condition, fault resistance, and inception angle.

Download Full-text

Hybrid Machine Learning Approach for Skin Disease Detection Using Optimal Support Vector Machine

Intelligent Data Communication Technologies and Internet of Things - Lecture Notes on Data Engineering and Communications Technologies ◽

10.1007/978-3-030-34080-3_73 ◽

2019 ◽

pp. 647-658

Author(s):

K. Melbin ◽

Y. Jacob Vetha Raj

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Skin Disease ◽

Support Vector ◽

Disease Detection ◽

Learning Approach ◽

Machine Learning Approach ◽

Hybrid Machine

Download Full-text

Driver Stress State Evaluation by Means of Thermal Imaging: A Supervised Machine Learning Approach Based on ECG Signal

Applied Sciences ◽

10.3390/app10165673 ◽

2020 ◽

Vol 10 (16) ◽

pp. 5673 ◽

Cited By ~ 2

Author(s):

Daniela Cardone ◽

David Perpetuini ◽

Chiara Filippini ◽

Edoardo Spadolini ◽

Lorenza Mancini ◽

...

Keyword(s):

Machine Learning ◽

Stress State ◽

Thermal Imaging ◽

Driving Simulator ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Approach ◽

Machine Learning Approach ◽

Thermal Features ◽

Driver Stress

Traffic accidents determine a large number of injuries, sometimes fatal, every year. Among other factors affecting a driver’s performance, an important role is played by stress which can decrease decision-making capabilities and situational awareness. In this perspective, it would be beneficial to develop a non-invasive driver stress monitoring system able to recognize the driver’s altered state. In this study, a contactless procedure for drivers’ stress state assessment by means of thermal infrared imaging was investigated. Thermal imaging was acquired during an experiment on a driving simulator, and thermal features of stress were investigated with comparison to a gold-standard metric (i.e., the stress index, SI) extracted from contact electrocardiography (ECG). A data-driven multivariate machine learning approach based on a non-linear support vector regression (SVR) was employed to estimate the SI through thermal features extracted from facial regions of interest (i.e., nose tip, nostrils, glabella). The predicted SI showed a good correlation with the real SI (r = 0.61, p = ~0). A two-level classification of the stress state (STRESS, SI ≥ 150, versus NO STRESS, SI < 150) was then performed based on the predicted SI. The ROC analysis showed a good classification performance with an AUC of 0.80, a sensitivity of 77%, and a specificity of 78%.

Download Full-text

A machine-learning approach for structural damage detection using least square support vector machine based on a new combinational kernel function

Structural Health Monitoring ◽

10.1177/1475921716639587 ◽

2016 ◽

Vol 15 (3) ◽

pp. 302-316 ◽

Cited By ~ 42

Author(s):

Ramin Ghiasi ◽

Peyman Torkzadeh ◽

Mohammad Noori

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Damage Detection ◽

Kernel Function ◽

Structural Damage ◽

Least Square ◽

Support Vector ◽

Learning Approach ◽

Structural Damage Detection ◽

Machine Learning Approach

Download Full-text

Validation of miRNAs as Breast Cancer Biomarkers with a Machine Learning Approach

Cancers ◽

10.3390/cancers11030431 ◽

2019 ◽

Vol 11 (3) ◽

pp. 431 ◽

Cited By ~ 11

Author(s):

Oneeb Rehman ◽

Hanqi Zhuang ◽

Ali Muhamed Ali ◽

Ali Ibrahim ◽

Zhongwei Li

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Information Gain ◽

Support Vector ◽

Learning Approach ◽

Breast Cancers ◽

Functional Studies ◽

Normal Tissues ◽

Machine Learning Approach ◽

Chi Squared

Certain small noncoding microRNAs (miRNAs) are differentially expressed in normal tissues and cancers, which makes them great candidates for biomarkers for cancer. Previously, a selected subset of miRNAs has been experimentally verified to be linked to breast cancer. In this paper, we validated the importance of these miRNAs using a machine learning approach on miRNA expression data. We performed feature selection, using Information Gain (IG), Chi-Squared (CHI2) and Least Absolute Shrinkage and Selection Operation (LASSO), on the set of these relevant miRNAs to rank them by importance. We then performed cancer classification using these miRNAs as features using Random Forest (RF) and Support Vector Machine (SVM) classifiers. Our results demonstrated that the miRNAs ranked higher by our analysis had higher classifier performance. Performance becomes lower as the rank of the miRNA decreases, confirming that these miRNAs had different degrees of importance as biomarkers. Furthermore, we discovered that using a minimum of three miRNAs as biomarkers for breast cancers can be as effective as using the entire set of 1800 miRNAs. This work suggests that machine learning is a useful tool for functional studies of miRNAs for cancer detection and diagnosis.

Download Full-text

Machine Learning Approach to Raman Spectrum Analysis of MIA PaCa-2 Pancreatic Cancer Tumor Repopulating Cells for Classification and Feature Analysis

Life ◽

10.3390/life10090181 ◽

2020 ◽

Vol 10 (9) ◽

pp. 181

Author(s):

Christopher T. Mandrell ◽

Torrey E. Holland ◽

James F. Wheeler ◽

Sakineh M. A. Esmaeili ◽

Kshitij Amar ◽

...

Keyword(s):

Machine Learning ◽

Pancreatic Cancer ◽

Raman Spectra ◽

Pancreatic Cancer Cell Line ◽

Cancer Cell Line ◽

Human Pancreatic Cancer ◽

Support Vector ◽

Learning Approach ◽

K Nearest Neighbor ◽

Machine Learning Approach

A machine learning approach is applied to Raman spectra of cells from the MIA PaCa-2 human pancreatic cancer cell line to distinguish between tumor repopulating cells (TRCs) and parental control cells, and to aid in the identification of molecular signatures. Fifty-one Raman spectra from the two types of cells are analyzed to determine the best combination of data type, dimension size, and classification technique to differentiate the cell types. An accuracy of 0.98 is obtained from support vector machine (SVM) and k-nearest neighbor (kNN) classifiers with various dimension reduction and feature selection tools. We also identify some possible biomolecules that cause the spectral peaks that led to the best results.

Download Full-text

Predicting pulmonary function from the analysis of voice: a machine learning approach

10.1101/2021.05.11.21256997 ◽

2021 ◽

Author(s):

Md. Zahangir Alam ◽

Albino Simonetti ◽

Rafaelle Billantino ◽

Nick Tayler ◽

Chris Grainge ◽

...

Keyword(s):

Machine Learning ◽

Lung Function ◽

Predictive Models ◽

Binary Classification ◽

Support Vector ◽

Learning Approach ◽

Classification Models ◽

Special Equipment ◽

Self Monitoring ◽

Machine Learning Approach

Providing proper timely treatment of asthma, self-monitoring can play a vital role in disease control. Existing methods (such as peak flow meter, smart spirometer) requires special equipment and are not always used by the patient. Using voice recording as surrogate measures of lung function can be used to assess asthma, which has good potential to self-monitor asthma and could be integrated into telehealth platforms. This study aims to apply machine learning approach to predict lung functions from recorded voice for asthma patients. A threshold-based mechanism was designed to separate speech and breathing from recordings (323 recordings from 26 participants) and features extracted from these were combined with biological attributes and lung function (percentage predicted forced expiratory volume in 1 second, FEV1%). Three predictive models were developed: (a) regression models to predict lung function, (b) multi-class classification models to predict the severity, and (c) binary classification models to predict abnormality. Random Forest (RF), Support Vector Machine (SVM), and Linear Regression (LR) algorithms were implemented to develop these predictive models. Training and test samples were separated (70%:30% using balanced portioning). Features were normalised and 10-fold cross-validation used to measure the model's training performances on the training samples. Models were then run on the test samples to measure the final performances. The RF based regression model performed better with lowest root mean square error = 10.86, and mean absolute score = 11.47, as compared to other models. In predicting the severity of lung function, the SVM based model performed better with 73.20% accuracy. The RF based model performed better in binary classification models for predicting abnormality of lung function (accuracy = 0.85, F1-score = 0.84, and area under the receiver operating characteristic curve = 0.88). The proposed machine learning approach can predict lung function (in terms of FEV1%), from the recorded voice files, better than other published approaches. These models can be extended to predict both the severity and abnormality of lung function with reasonable accuracies. This technique could be used to develop future telehealth solutions including smartphone-based applications which have potential to aid decision making and self-monitoring in asthma.

Download Full-text