Machine-Learning-Based External Plagiarism Detecting Methodology From Monolingual Documents

Author(s):  
Saugata Bose ◽  
Ritambhra Korpal

In this chapter, an initiative is proposed where natural language processing (NLP) techniques and supervised machine learning algorithms have been combined to detect external plagiarism. The major emphasis is on to construct a framework to detect plagiarism from monolingual texts by implementing n-gram frequency comparison approach. The framework is based on 120 characteristics which have been extracted during pre-processing steps using simple NLP approach. Afterward, filter metrics has been applied to select most relevant features and supervised classification learning algorithm has been used later to classify the documents in four levels of plagiarism. Then, confusion matrix was built to estimate the false positives and false negatives. Finally, the authors have shown C4.5 decision tree-based classifier's suitability on calculating accuracy over naive Bayes. The framework achieved 89% accuracy with low false positive and false negative rate and it shows higher precision and recall value comparing to passage similarities method, sentence similarity method, and search space reduction method.

Author(s):  
Saugata Bose ◽  
Ritambhra Korpal

In this chapter, an initiative is proposed where natural language processing (NLP) techniques and supervised machine learning algorithms have been combined to detect external plagiarism. The major emphasis is on to construct a framework to detect plagiarism from monolingual texts by implementing n-gram frequency comparison approach. The framework is based on 120 characteristics which have been extracted during pre-processing steps using simple NLP approach. Afterward, filter metrics has been applied to select most relevant features and supervised classification learning algorithm has been used later to classify the documents in four levels of plagiarism. Then, confusion matrix was built to estimate the false positives and false negatives. Finally, the authors have shown C4.5 decision tree-based classifier's suitability on calculating accuracy over naive Bayes. The framework achieved 89% accuracy with low false positive and false negative rate and it shows higher precision and recall value comparing to passage similarities method, sentence similarity method, and search space reduction method.


2021 ◽  
Vol 9 ◽  
Author(s):  
Mavra Mehmood ◽  
Muhammad Rizwan ◽  
Michal Gregus ml ◽  
Sidra Abbas

Cervical malignant growth is the fourth most typical reason for disease demise in women around the globe. Cervical cancer growth is related to human papillomavirus (HPV) contamination. Early screening made cervical cancer a preventable disease that results in minimizing the global burden of cervical cancer. In developing countries, women do not approach sufficient screening programs because of the costly procedures to undergo examination regularly, scarce awareness, and lack of access to the medical center. In this manner, the expectation of the individual patient's risk becomes very high. There are many risk factors relevant to malignant cervical formation. This paper proposes an approach named CervDetect that uses machine learning algorithms to evaluate the risk elements of malignant cervical formation. CervDetect uses Pearson correlation between input variables as well as with the output variable to pre-process the data. CervDetect uses the random forest (RF) feature selection technique to select significant features. Finally, CervDetect uses a hybrid approach by combining RF and shallow neural networks to detect Cervical Cancer. Results show that CervDetect accurately predicts cervical cancer, outperforms the state-of-the-art studies, and achieved an accuracy of 93.6%, mean squared error (MSE) error of 0.07111, false-positive rate (FPR) of 6.4%, and false-negative rate (FNR) of 100%.


2021 ◽  
Author(s):  
Prasannavenkatesan Theerthagiri ◽  
Usha Ruby A ◽  
Vidya J

Abstract Diabetes mellitus is characterized as a chronic disease may cause many complications. The machine learning algorithms are used to diagnosis and predict the diabetes. The learning based algorithms plays a vital role on supporting decision making in disease diagnosis and prediction. In this paper, traditional classification algorithms and neural network based machine learning are investigated for the diabetes dataset. Also, various performance methods with different aspects are evaluated for the K-nearest neighbor, Naive Bayes, extra trees, decision trees, radial basis function, and multilayer perceptron algorithms. It supports the estimation on patients suffering from diabetes in future. The results of this work shows that the multilayer perceptron algorithm gives the highest prediction accuracy with lowest MSE of 0.19. The MLP gives the lowest false positive rate and false negative rate with highest area under curve of 86 %.


2020 ◽  
Vol 16 (2) ◽  
pp. 87-109 ◽  
Author(s):  
Poorani Marimuthu ◽  
Varalakshmi Perumal ◽  
Vaidehi Vijayakumar

Machine learning algorithms are extensively used in healthcare analytics to learn normal and abnormal patterns automatically. The detection and prediction accuracy of any machine learning model depends on many factors like ground truth instances, attribute relationships, model design, the size of the dataset, the percentage of uncertainty, the training and testing environment, etc. Prediction models in healthcare should generate a minimal false positive and false negative rate. To accomplish high classification or prediction accuracy, the screening of health status needs to be personalized rather than following general clinical practice guidelines (CPG) which fits for an average population. Hence, a personalized screening model (IPAD – Intelligent Personalized Abnormality Detection) for remote healthcare is proposed that tailored to specific individual. The severity level of the abnormal status has been derived using personalized health values and the IPAD model obtains an area under the curve (AUC) of 0.907.


Author(s):  
Máté E. Maros ◽  
Chang Gyu Cho ◽  
Andreas G. Junge ◽  
Benedikt Kämpgen ◽  
Victor Saase ◽  
...  

Objectives: Studies evaluating machine learning (ML) algorithms on cross-lingual RadLex® mappings for developing context-sensitive radiological reporting tools are lacking. Therefore, we investigated whether ML-based approaches can be utilized to assist radiologists in providing key imaging biomarkers – such as The Alberta stroke programme early CT score (APECTS). Material and Methods: A stratified random sample (age, gender, year) of CT reports (n=206) with suspected ischemic stroke was generated out of 3997 reports signed off between 2015-2019. Three independent, blinded readers assessed these reports and manually annotated clinico-radiologically relevant key features. The primary outcome was whether ASPECTS should have been provided (yes/no: 154/52). For all reports, both the findings and impressions underwent cross-lingual (German to English) RadLex®-mappings using natural language processing. Well-established ML-algorithms including classification trees, random forests, elastic net, support vector machines (SVMs) and boosted trees were evaluated in a 5 x 5-fold nested cross-validation framework. Further, a linear classifier (fastText) was directly fitted on the German reports. Ensemble learning was used to provide robust importance rankings of these ML-algorithms. Performance was evaluated using derivates of the confusion matrix and metrics of calibration including AUC, brier score and log loss as well as visually by calibration plots. Results: On this imbalanced classification task SVMs showed the highest accuracies both on human-extracted- (87%) and fully automated RadLex® features (findings: 82.5%; impressions: 85.4%). FastText without pre-trained language model showed the highest accuracy (89.3%) and AUC (92%) on the impressions. Ensemble learner revealed that boosted trees, fastText and SVMs are the most important ML-classifiers. Boosted trees fitted on the findings showed the best overall calibration curve. Conclusions: Contextual ML-based assistance suggesting ASPECTS while reporting neuroradiological emergencies is feasible, even if ML-models are restricted to be developed on limited and highly imbalanced data sets.


2014 ◽  
Vol 11 (1) ◽  
pp. 175-188 ◽  
Author(s):  
Nemanja Macek ◽  
Milan Milosavljevic

The KDD Cup '99 is commonly used dataset for training and testing IDS machine learning algorithms. Some of the major downsides of the dataset are the distribution and the proportions of U2R and R2L instances, which represent the most dangerous attack types, as well as the existence of R2L attack instances identical to normal traffic. This enforces minor category detection complexity and causes problems while building a machine learning model capable of detecting these attacks with sufficiently low false negative rate. This paper presents a new support vector machine based intrusion detection system that classifies unknown data instances according both to the feature values and weight factors that represent importance of features towards the classification. Increased detection rate and significantly decreased false negative rate for U2R and R2L categories, that have a very few instances in the training set, have been empirically proven.


Author(s):  
Aditya Parameswaran ◽  
Dibyendu Mishra ◽  
Sanchit Bansal ◽  
Vinayak Agarwal ◽  
Anjali Goyal ◽  
...  

Background. Office of Academic Affairs (OAA), Office of Student Life (OSL) and Information Technology Helpdesk (ITD) are support functions within a university which receives hundreds of email messages on the daily basis. A large percentage of emails received by these departments are frequent and commonly used queries or request for information. Responding to every query by manually typing is a tedious and time consuming task and an automated approach for email response suggestion can save lot of time. Methods. We propose an application and solution approach for automatically generating and suggesting short email responses to support queries in a university environment. Our proposed solution can be used as one tap or one click solution for responding to various types of queries raised by faculty members and students in a university. We create a dataset for the application domain and make it publicly available. We apply a machine learning framework for classifying emails into categories such as office of academic affairs or information technology department. We apply a machine learning based classification approach for sub-category level classification also. We apply text pre-processing techniques, feature selection, support vector machine and naïve naive classifiers. We present an approach to overcome various natural language processing based challenges in the text. Results. We conduct a series of experiments and evaluate the approach using confusion matrix and accuracy based metrics. We study the discriminatory power of features and compare their relevance for the classification task. Our experimental results reveal that the proposed approach is effective. We conclude from our experiments that discriminatory features can be extracted from the text within our specific domain and automatic email response suggestion can be accurately created using machine learning algorithms and framework. We experiment with two different learning algorithms and observe that SVM outperforms Naïve Bayes. We achieve a classification accuracy of above $85\%$ for all the classes and sub-classes. Discussion. Our experiments on email response suggestion are conducted on a corpus consists of short and frequent emails by a university function but the proposed approach and techniques can be generalized to other domains also. We observe that different classifiers give different results and there is a significant difference in the predictive power of features.


2021 ◽  
Author(s):  
Prasannavenkatesan Theerthagiri ◽  
Usha Ruby A ◽  
Vidya J

Abstract Diabetes mellitus is characterized as a chronic disease may cause many complications. The machine learning algorithms are used to diagnosis and predict the diabetes. The learning based algorithms plays a vital role on supporting decision making in disease diagnosis and prediction. In this paper, traditional classification algorithms and neural network based machine learning are investigated for the diabetes dataset. Also, various performance methods with different aspects are evaluated for the K-nearest neighbor, Naive Bayes, extra trees, decision trees, radial basis function, and multilayer perceptron algorithms. It supports the estimation on patients suffering from diabetes in future. The results of this work shows that the multilayer perceptron algorithm gives the highest prediction accuracy with lowest MSE of 0.19. The MLP gives the lowest false positive rate and false negative rate with highest area under curve of 86 %.


Financial Crisis has been the stern problem experienced by various organizations or even common people when interested in investing in any Financial institutions like banks, Funds development institutions etc. Hence it is mandatory that a reliable prediction system should be applied in early prediction of Financial Crisis Prediction thereby preventing investment in weak financial institutions that might lead to bankruptcy. The Paper focuses on designing a Hybrid Optimized Algorithm called Hybrid Unified Machine Classifier (HUMC) based on Machine Learning Technique that would be capable of identifying categorized and continuous variables in a financial crisis dataset and determine the confusion matrix that can be instilled in performance analysis tool comprising of analytics and prediction related to Accuracy, F-Score, Sensitivity, Specificity, False Positive Rate (FPR) and False Negative Rate (FNR) respectively. Early testing with the training set of Australian credit dataset were tested with machine learning classifiers like Decision Tree, PART, Naive Bayesian, RBF Network and Multilayer Perceptron algorithms with accuracies 85.50%, 83.62%, 77.24%, 82.75% and 84.93% respectively. The Algorithm HUMC was developed based on combining classification features from decision tree, identifying hidden nodes and model with boosting technique that could enhance the performance levels of the Financial Crisis Prediction. The design of algorithm comprised of best characteristics of both classification and neural networks that are capable to find categorization criteria in the dataset at the first level and also to find the hidden continuous data during the second stage respectively. The design of HUMC was implemented and tested with MATLAB. The Result showed that HUMC algorithm showed greater accuracy (86.25%) in comparison to other classifier models along with other performance measures. Thus, this algorithm enhances the prediction of Financial Crisis predictions with good performance.


2021 ◽  
Author(s):  
Prasannavenkatesan Theerthagiri ◽  
Usha Ruby A ◽  
Vidya J

Abstract Diabetes mellitus is characterized as a chronic disease may cause many complications. The machine learning algorithms are used to diagnosis and predict the diabetes. The learning based algorithms plays a vital role on supporting decision making in disease diagnosis and prediction. In this paper, traditional classification algorithms and neural network based machine learning are investigated for the diabetes dataset. Also, various performance methods with different aspects are evaluated for the K-nearest neighbor, Naive Bayes, extra trees, decision trees, radial basis function, and multilayer perceptron algorithms. It supports the estimation on patients suffering from diabetes in future. The results of this work shows that the multilayer perceptron algorithm gives the highest prediction accuracy with lowest MSE of 0.19. The MLP gives the lowest false positive rate and false negative rate with highest area under curve of 86 %.


Sign in / Sign up

Export Citation Format

Share Document