Machine-Learning-Based External Plagiarism Detecting Methodology From Monolingual Documents

2019 ◽

pp. 442-458

Author(s):

Saugata Bose ◽

Ritambhra Korpal

Keyword(s):

Machine Learning ◽

Language Processing ◽

Confusion Matrix ◽

False Negative ◽

False Negative Rate ◽

Search Space ◽

Machine Learning Algorithms ◽

C4.5 Decision Tree ◽

N Gram ◽

Four Levels

In this chapter, an initiative is proposed where natural language processing (NLP) techniques and supervised machine learning algorithms have been combined to detect external plagiarism. The major emphasis is on to construct a framework to detect plagiarism from monolingual texts by implementing n-gram frequency comparison approach. The framework is based on 120 characteristics which have been extracted during pre-processing steps using simple NLP approach. Afterward, filter metrics has been applied to select most relevant features and supervised classification learning algorithm has been used later to classify the documents in four levels of plagiarism. Then, confusion matrix was built to estimate the false positives and false negatives. Finally, the authors have shown C4.5 decision tree-based classifier's suitability on calculating accuracy over naive Bayes. The framework achieved 89% accuracy with low false positive and false negative rate and it shows higher precision and recall value comparing to passage similarities method, sentence similarity method, and search space reduction method.

Download Full-text

Machine Learning Assisted Cervical Cancer Detection

Frontiers in Public Health ◽

10.3389/fpubh.2021.788376 ◽

2021 ◽

Vol 9 ◽

Author(s):

Mavra Mehmood ◽

Muhammad Rizwan ◽

Michal Gregus ml ◽

Sidra Abbas

Keyword(s):

Machine Learning ◽

Cervical Cancer ◽

Mean Squared Error ◽

Medical Center ◽

Pearson Correlation ◽

False Negative ◽

Hybrid Approach ◽

False Negative Rate ◽

Machine Learning Algorithms ◽

Screening Programs

Cervical malignant growth is the fourth most typical reason for disease demise in women around the globe. Cervical cancer growth is related to human papillomavirus (HPV) contamination. Early screening made cervical cancer a preventable disease that results in minimizing the global burden of cervical cancer. In developing countries, women do not approach sufficient screening programs because of the costly procedures to undergo examination regularly, scarce awareness, and lack of access to the medical center. In this manner, the expectation of the individual patient's risk becomes very high. There are many risk factors relevant to malignant cervical formation. This paper proposes an approach named CervDetect that uses machine learning algorithms to evaluate the risk elements of malignant cervical formation. CervDetect uses Pearson correlation between input variables as well as with the output variable to pre-process the data. CervDetect uses the random forest (RF) feature selection technique to select significant features. Finally, CervDetect uses a hybrid approach by combining RF and shallow neural networks to detect Cervical Cancer. Results show that CervDetect accurately predicts cervical cancer, outperforms the state-of-the-art studies, and achieved an accuracy of 93.6%, mean squared error (MSE) error of 0.07111, false-positive rate (FPR) of 6.4%, and false-negative rate (FNR) of 100%.

Download Full-text

Diagnosis and Classification of the Diabetes Using Machine Learning Algorithms

10.21203/rs.3.rs-514771/v2 ◽

2021 ◽

Author(s):

Prasannavenkatesan Theerthagiri ◽

Usha Ruby A ◽

Vidya J

Keyword(s):

Machine Learning ◽

Multilayer Perceptron ◽

Nearest Neighbor ◽

False Positive Rate ◽

Learning Algorithms ◽

False Negative ◽

False Negative Rate ◽

Disease Diagnosis ◽

Machine Learning Algorithms ◽

K Nearest Neighbor

Abstract Diabetes mellitus is characterized as a chronic disease may cause many complications. The machine learning algorithms are used to diagnosis and predict the diabetes. The learning based algorithms plays a vital role on supporting decision making in disease diagnosis and prediction. In this paper, traditional classification algorithms and neural network based machine learning are investigated for the diabetes dataset. Also, various performance methods with different aspects are evaluated for the K-nearest neighbor, Naive Bayes, extra trees, decision trees, radial basis function, and multilayer perceptron algorithms. It supports the estimation on patients suffering from diabetes in future. The results of this work shows that the multilayer perceptron algorithm gives the highest prediction accuracy with lowest MSE of 0.19. The MLP gives the lowest false positive rate and false negative rate with highest area under curve of 86 %.

Download Full-text

Intelligent Personalized Abnormality Detection for Remote Health Monitoring

International Journal of Intelligent Information Technologies ◽

10.4018/ijiit.2020040105 ◽

2020 ◽

Vol 16 (2) ◽

pp. 87-109 ◽

Cited By ~ 1

Author(s):

Poorani Marimuthu ◽

Varalakshmi Perumal ◽

Vaidehi Vijayakumar

Keyword(s):

Machine Learning ◽

Prediction Accuracy ◽

Prediction Models ◽

False Negative ◽

False Negative Rate ◽

Area Under The Curve ◽

Ground Truth ◽

Machine Learning Algorithms ◽

Abnormality Detection ◽

Remote Healthcare

Machine learning algorithms are extensively used in healthcare analytics to learn normal and abnormal patterns automatically. The detection and prediction accuracy of any machine learning model depends on many factors like ground truth instances, attribute relationships, model design, the size of the dataset, the percentage of uncertainty, the training and testing environment, etc. Prediction models in healthcare should generate a minimal false positive and false negative rate. To accomplish high classification or prediction accuracy, the screening of health status needs to be personalized rather than following general clinical practice guidelines (CPG) which fits for an average population. Hence, a personalized screening model (IPAD – Intelligent Personalized Abnormality Detection) for remote healthcare is proposed that tailored to specific individual. The severity level of the abnormal status has been derived using personalized health values and the IPAD model obtains an area under the curve (AUC) of 0.907.

Download Full-text

Comparative Analysis of Machine Learning Algorithms for Computer-Assisted Reporting Based on Fully Automated Cross-Lingual RadLex® Mappings

10.20944/preprints202004.0354.v1 ◽

2020 ◽

Author(s):

Máté E. Maros ◽

Chang Gyu Cho ◽

Andreas G. Junge ◽

Benedikt Kämpgen ◽

Victor Saase ◽

...

Keyword(s):

Machine Learning ◽

Language Processing ◽

Confusion Matrix ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Imaging Biomarkers ◽

Brier Score ◽

Support Vector ◽

Computer Assisted ◽

Cross Lingual

Objectives: Studies evaluating machine learning (ML) algorithms on cross-lingual RadLex® mappings for developing context-sensitive radiological reporting tools are lacking. Therefore, we investigated whether ML-based approaches can be utilized to assist radiologists in providing key imaging biomarkers – such as The Alberta stroke programme early CT score (APECTS). Material and Methods: A stratified random sample (age, gender, year) of CT reports (n=206) with suspected ischemic stroke was generated out of 3997 reports signed off between 2015-2019. Three independent, blinded readers assessed these reports and manually annotated clinico-radiologically relevant key features. The primary outcome was whether ASPECTS should have been provided (yes/no: 154/52). For all reports, both the findings and impressions underwent cross-lingual (German to English) RadLex®-mappings using natural language processing. Well-established ML-algorithms including classification trees, random forests, elastic net, support vector machines (SVMs) and boosted trees were evaluated in a 5 x 5-fold nested cross-validation framework. Further, a linear classifier (fastText) was directly fitted on the German reports. Ensemble learning was used to provide robust importance rankings of these ML-algorithms. Performance was evaluated using derivates of the confusion matrix and metrics of calibration including AUC, brier score and log loss as well as visually by calibration plots. Results: On this imbalanced classification task SVMs showed the highest accuracies both on human-extracted- (87%) and fully automated RadLex® features (findings: 82.5%; impressions: 85.4%). FastText without pre-trained language model showed the highest accuracy (89.3%) and AUC (92%) on the impressions. Ensemble learner revealed that boosted trees, fastText and SVMs are the most important ML-classifiers. Boosted trees fitted on the findings showed the best overall calibration curve. Conclusions: Contextual ML-based assistance suggesting ASPECTS while reporting neuroradiological emergencies is feasible, even if ML-models are restricted to be developed on limited and highly imbalanced data sets.

Download Full-text

Reducing U2R and R2L category false negative rates with support vector machines

Serbian Journal of Electrical Engineering ◽

10.2298/sjee131007015m ◽

2014 ◽

Vol 11 (1) ◽

pp. 175-188 ◽

Cited By ~ 1

Author(s):

Nemanja Macek ◽

Milan Milosavljevic

Keyword(s):

Machine Learning ◽

Detection System ◽

False Negative ◽

False Negative Rate ◽

Machine Learning Algorithms ◽

Support Vector ◽

Negative Rate ◽

Machine Learning Model ◽

Vector Machines ◽

Feature Values

The KDD Cup '99 is commonly used dataset for training and testing IDS machine learning algorithms. Some of the major downsides of the dataset are the distribution and the proportions of U2R and R2L instances, which represent the most dangerous attack types, as well as the existence of R2L attack instances identical to normal traffic. This enforces minor category detection complexity and causes problems while building a machine learning model capable of detecting these attacks with sufficiently low false negative rate. This paper presents a new support vector machine based intrusion detection system that classifies unknown data instances according both to the feature values and weight factors that represent importance of features towards the classification. Increased detection rate and significantly decreased false negative rate for U2R and R2L categories, that have a very few instances in the training set, have been empirically proven.

Download Full-text

Automatic email response suggestion for support departments within a university

10.7287/peerj.preprints.26531v1 ◽

2018 ◽

Cited By ~ 1

Author(s):

Aditya Parameswaran ◽

Dibyendu Mishra ◽

Sanchit Bansal ◽

Vinayak Agarwal ◽

Anjali Goyal ◽

...

Keyword(s):

Machine Learning ◽

Information Technology ◽

Language Processing ◽

Confusion Matrix ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Academic Affairs ◽

Support Vector ◽

Support Functions ◽

Significant Difference

Background. Office of Academic Affairs (OAA), Office of Student Life (OSL) and Information Technology Helpdesk (ITD) are support functions within a university which receives hundreds of email messages on the daily basis. A large percentage of emails received by these departments are frequent and commonly used queries or request for information. Responding to every query by manually typing is a tedious and time consuming task and an automated approach for email response suggestion can save lot of time. Methods. We propose an application and solution approach for automatically generating and suggesting short email responses to support queries in a university environment. Our proposed solution can be used as one tap or one click solution for responding to various types of queries raised by faculty members and students in a university. We create a dataset for the application domain and make it publicly available. We apply a machine learning framework for classifying emails into categories such as office of academic affairs or information technology department. We apply a machine learning based classification approach for sub-category level classification also. We apply text pre-processing techniques, feature selection, support vector machine and naïve naive classifiers. We present an approach to overcome various natural language processing based challenges in the text. Results. We conduct a series of experiments and evaluate the approach using confusion matrix and accuracy based metrics. We study the discriminatory power of features and compare their relevance for the classification task. Our experimental results reveal that the proposed approach is effective. We conclude from our experiments that discriminatory features can be extracted from the text within our specific domain and automatic email response suggestion can be accurately created using machine learning algorithms and framework. We experiment with two different learning algorithms and observe that SVM outperforms Naïve Bayes. We achieve a classification accuracy of above $85\%$ for all the classes and sub-classes. Discussion. Our experiments on email response suggestion are conducted on a corpus consists of short and frequent emails by a university function but the proposed approach and techniques can be generalized to other domains also. We observe that different classifiers give different results and there is a significant difference in the predictive power of features.

Download Full-text

Diagnosis and Classification of the Diabetes Using Machine Learning Algorithms

10.21203/rs.3.rs-514771/v1 ◽

2021 ◽

Author(s):

Prasannavenkatesan Theerthagiri ◽

Usha Ruby A ◽

Vidya J

Keyword(s):

Machine Learning ◽

Multilayer Perceptron ◽

Nearest Neighbor ◽

False Positive Rate ◽

Learning Algorithms ◽

False Negative ◽

False Negative Rate ◽

Disease Diagnosis ◽

Machine Learning Algorithms ◽

K Nearest Neighbor

Abstract Diabetes mellitus is characterized as a chronic disease may cause many complications. The machine learning algorithms are used to diagnosis and predict the diabetes. The learning based algorithms plays a vital role on supporting decision making in disease diagnosis and prediction. In this paper, traditional classification algorithms and neural network based machine learning are investigated for the diabetes dataset. Also, various performance methods with different aspects are evaluated for the K-nearest neighbor, Naive Bayes, extra trees, decision trees, radial basis function, and multilayer perceptron algorithms. It supports the estimation on patients suffering from diabetes in future. The results of this work shows that the multilayer perceptron algorithm gives the highest prediction accuracy with lowest MSE of 0.19. The MLP gives the lowest false positive rate and false negative rate with highest area under curve of 86 %.

Download Full-text

Enhancement in Performance of Financial Crisis Prediction using Hybridization of Machine Learning Classifiers

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c8722.019320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 1279-1285

Keyword(s):

Machine Learning ◽

Financial Crisis ◽

Decision Tree ◽

Financial Institutions ◽

Confusion Matrix ◽

False Negative ◽

False Negative Rate ◽

Analysis Tool ◽

Machine Learning Classifiers ◽

Learning Classifiers

Financial Crisis has been the stern problem experienced by various organizations or even common people when interested in investing in any Financial institutions like banks, Funds development institutions etc. Hence it is mandatory that a reliable prediction system should be applied in early prediction of Financial Crisis Prediction thereby preventing investment in weak financial institutions that might lead to bankruptcy. The Paper focuses on designing a Hybrid Optimized Algorithm called Hybrid Unified Machine Classifier (HUMC) based on Machine Learning Technique that would be capable of identifying categorized and continuous variables in a financial crisis dataset and determine the confusion matrix that can be instilled in performance analysis tool comprising of analytics and prediction related to Accuracy, F-Score, Sensitivity, Specificity, False Positive Rate (FPR) and False Negative Rate (FNR) respectively. Early testing with the training set of Australian credit dataset were tested with machine learning classifiers like Decision Tree, PART, Naive Bayesian, RBF Network and Multilayer Perceptron algorithms with accuracies 85.50%, 83.62%, 77.24%, 82.75% and 84.93% respectively. The Algorithm HUMC was developed based on combining classification features from decision tree, identifying hidden nodes and model with boosting technique that could enhance the performance levels of the Financial Crisis Prediction. The design of algorithm comprised of best characteristics of both classification and neural networks that are capable to find categorization criteria in the dataset at the first level and also to find the hidden continuous data during the second stage respectively. The design of HUMC was implemented and tested with MATLAB. The Result showed that HUMC algorithm showed greater accuracy (86.25%) in comparison to other classifier models along with other performance measures. Thus, this algorithm enhances the prediction of Financial Crisis predictions with good performance.

Download Full-text

Diagnosis and Classification of the Diabetes Using Machine Learning Algorithms

10.21203/rs.3.rs-514771/v3 ◽

2021 ◽

Author(s):

Prasannavenkatesan Theerthagiri ◽

Usha Ruby A ◽

Vidya J

Keyword(s):

Machine Learning ◽

Multilayer Perceptron ◽

Nearest Neighbor ◽

False Positive Rate ◽

Learning Algorithms ◽

False Negative ◽

False Negative Rate ◽

Disease Diagnosis ◽

Machine Learning Algorithms ◽

K Nearest Neighbor

Abstract Diabetes mellitus is characterized as a chronic disease may cause many complications. The machine learning algorithms are used to diagnosis and predict the diabetes. The learning based algorithms plays a vital role on supporting decision making in disease diagnosis and prediction. In this paper, traditional classification algorithms and neural network based machine learning are investigated for the diabetes dataset. Also, various performance methods with different aspects are evaluated for the K-nearest neighbor, Naive Bayes, extra trees, decision trees, radial basis function, and multilayer perceptron algorithms. It supports the estimation on patients suffering from diabetes in future. The results of this work shows that the multilayer perceptron algorithm gives the highest prediction accuracy with lowest MSE of 0.19. The MLP gives the lowest false positive rate and false negative rate with highest area under curve of 86 %.

Download Full-text