Identification of Malignancies from Free-Text Histopathology Reports Using a Multi-Model Supervised Machine Learning Approach

Victor Olago; Mazvita Muchengeti; Elvira Singh; Wenlong C. Chen

doi:10.3390/info11090455

Identification of Malignancies from Free-Text Histopathology Reports Using a Multi-Model Supervised Machine Learning Approach

Information ◽

10.3390/info11090455 ◽

2020 ◽

Vol 11 (9) ◽

pp. 455

Author(s):

Victor Olago ◽

Mazvita Muchengeti ◽

Elvira Singh ◽

Wenlong C. Chen

Keyword(s):

Machine Learning ◽

Cancer Registries ◽

Sub Saharan Africa ◽

Supervised Machine Learning ◽

Stochastic Gradient Descent ◽

Misclassification Rate ◽

Support Vector ◽

Free Text ◽

K Nearest Neighbor ◽

Adaptive Boosting

We explored various Machine Learning (ML) models to evaluate how each model performs in the task of classifying histopathology reports. We trained, optimized, and performed classification with Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN), Adaptive Boosting (AB), Decision Trees (DT), Gaussian Naïve Bayes (GNB), Logistic Regression (LR), and Dummy classifier. We started with 60,083 histopathology reports, which reduced to 60,069 after pre-processing. The F1-scores for SVM, SGD KNN, RF, DT, LR, AB, and GNB were 97%, 96%, 96%, 96%, 92%, 96%, 84%, and 88%, respectively, while the misclassification rates were 3.31%, 5.25%, 4.39%, 1.75%, 3.5%, 4.26%, 23.9%, and 19.94%, respectively. The approximate run times were 2 h, 20 min, 40 min, 8 h, 40 min, 10 min, 50 min, and 4 min, respectively. RF had the longest run time but the lowest misclassification rate on the labeled data. Our study demonstrated the possibility of applying ML techniques in the processing of free-text pathology reports for cancer registries for cancer incidence reporting in a Sub-Saharan Africa setting. This is an important consideration for the resource-constrained environments to leverage ML techniques to reduce workloads and improve the timeliness of reporting of cancer statistics.

Download Full-text

Prediction of Spam Email using Machine Learning Classification Algorithm

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.35226 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 1107-1112

Author(s):

P Sai Teja

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Classification Model ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

K Nearest Neighbor ◽

Adaptive Boosting ◽

Machine Learning Classification ◽

E Mail

Unsolicited e-mail also known as Spam has become a huge concern for each e-mail user. In recent times, it is very difficult to filter spam emails as these emails are produced or created or written in a very special manner so that anti-spam filters cannot detect such emails. This paper compares and reviews performance metrics of certain categories of supervised machine learning techniques such as SVM (Support Vector Machine), Random Forest, Decision Tree, CNN, (Convolutional Neural Network), KNN(K Nearest Neighbor), MLP(Multi-Layer Perceptron), Adaboost (Adaptive Boosting) Naïve Bayes algorithm to predict or classify into spam emails. The objective of this study is to consider the details or content of the emails, learn a finite dataset available and to develop a classification model that will be able to predict or classify whether an e-mail is spam or not.

Download Full-text

Supervised Machine Learning for Estimation of Total Suspended Solids in Urban Watersheds

Water ◽

10.3390/w13020147 ◽

2021 ◽

Vol 13 (2) ◽

pp. 147

Author(s):

Mohammadreza Moeini ◽

Ali Shojaeizadeh ◽

Mengistu Geza

Keyword(s):

Machine Learning ◽

Suspended Solids ◽

Nearest Neighbor ◽

Model Performance ◽

Total Suspended Solids ◽

Supervised Machine Learning ◽

Coefficient Of Determination ◽

Support Vector ◽

K Nearest Neighbor ◽

Adaptive Boosting

Machine Learning (ML) algorithms provide an alternative for the prediction of pollutant concentration. We compared eight ML algorithms (Linear Regression (LR), uniform weighting k-Nearest Neighbor (UW-kNN), variable weighting k-Nearest Neighbor (VW-kNN), Support Vector Regression (SVR), Artificial Neural Network (ANN), Regression Tree (RT), Random Forest (RF), and Adaptive Boosting (AdB)) to evaluate the feasibility of ML approaches for estimation of Total Suspended Solids (TSS) using the national stormwater quality database. Six factors were used as features to train the algorithms with TSS concentration as the target parameter: Drainage area, land use, percent of imperviousness, rainfall depth, runoff volume, and antecedent dry days. Comparisons among the ML methods demonstrated a higher degree of variability in model performance, with the coefficient of determination (R2) and Nash–Sutcliffe (NSE) values ranging from 0.15 to 0.77. The Root Mean Square (RMSE) values ranged from 110 mg/L to 220 mg/L. The best fit was obtained using the AdB and RF models, with R2 values of 0.77 and 0.74 in the training step and 0.67 and 0.64 in the prediction step. The NSE values were 0.76 and 0.72 in the training step and 0.67 and 0.62 in the prediction step. The predictions from AdB were sensitive to all six factors. However, the sensitivity level was variable.

Download Full-text

Evaluating Annotated Dataset of Customer Reviews for Aspect Based Sentiment Analysis

Journal of Web Engineering ◽

10.13052/jwe1540-9589.2122 ◽

2021 ◽

Author(s):

Dimple Chehal ◽

Parul Gupta ◽

Payal Gulati

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Sentiment Analysis ◽

Nearest Neighbor ◽

Supervised Machine Learning ◽

Support Vector ◽

Product Reviews ◽

K Nearest Neighbor ◽

Customer Reviews ◽

Percent Accuracy

Sentiment analysis of product reviews on e-commerce platforms aids in determining the preferences of customers. Aspect-based sentiment analysis (ABSA) assists in identifying the contributing aspects and their corresponding polarity, thereby allowing for a more detailed analysis of the customer’s inclination toward product aspects. This analysis helps in the transition from the traditional rating-based recommendation process to an improved aspect-based process. To automate ABSA, a labelled dataset is required to train a supervised machine learning model. As the availability of such dataset is limited due to the involvement of human efforts, an annotated dataset has been provided here for performing ABSA on customer reviews of mobile phones. The dataset comprising of product reviews of Apple-iPhone11 has been manually annotated with predefined aspect categories and aspect sentiments. The dataset’s accuracy has been validated using state-of-the-art machine learning techniques such as Naïve Bayes, Support Vector Machine, Logistic Regression, Random Forest, K-Nearest Neighbor and Multi Layer Perceptron, a sequential model built with Keras API. The MLP model built through Keras Sequential API for classifying review text into aspect categories produced the most accurate result with 67.45 percent accuracy. K- nearest neighbor performed the worst with only 49.92 percent accuracy. The Support Vector Machine had the highest accuracy for classifying review text into aspect sentiments with an accuracy of 79.46 percent. The model built with Keras API had the lowest 76.30 percent accuracy. The contribution is beneficial as a benchmark dataset for ABSA of mobile phone reviews.

Download Full-text

Bitcoin Theft Detection Based on Supervised Machine Learning Algorithms

Security and Communication Networks ◽

10.1155/2021/6643763 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Binjie Chen ◽

Fushan Wei ◽

Chunxiang Gu

Keyword(s):

Support Vector Machine ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Economic Losses ◽

Support Vector ◽

K Nearest Neighbor ◽

Security Threat ◽

Adaptive Boosting ◽

Supervised Methods ◽

Unsupervised Methods

Since its inception, Bitcoin has been subject to numerous thefts due to its enormous economic value. Hackers steal Bitcoin wallet keys to transfer Bitcoin from compromised users, causing huge economic losses to victims. To address the security threat of Bitcoin theft, supervised learning methods were used in this study to detect and provide warnings about Bitcoin theft events. To overcome the shortcomings of the existing work, more comprehensive features of Bitcoin transaction data were extracted, the unbalanced dataset was equalized, and five supervised methods—the k-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), adaptive boosting (AdaBoost), and multi-layer perceptron (MLP) techniques—as well as three unsupervised methods—the local outlier factor (LOF), one-class support vector machine (OCSVM), and Mahalanobis distance-based approach (MDB)—were used for detection. The best performer among these algorithms was the RF algorithm, which achieved recall, precision, and F1 values of 95.9%. The experimental results showed that the designed features are more effective than the currently used ones. The results of the supervised methods were significantly better than those of the unsupervised methods, and the results of the supervised methods could be further improved after equalizing the training set.

Download Full-text

Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes (Preprint)

10.2196/preprints.8344 ◽

2017 ◽

Author(s):

Chin Lin ◽

Chia-Jung Hsu ◽

Yu-Sheng Lou ◽

Shih-Jen Yeh ◽

Chia-Cheng Lee ◽

...

Keyword(s):

Machine Learning ◽

Word Embedding ◽

Supervised Machine Learning ◽

Support Vector ◽

Free Text ◽

Learning Models ◽

Diagnosis Codes ◽

Icd 10 ◽

F Measure ◽

Machine Learning Models

BACKGROUND Automated disease code classification using free-text medical information is important for public health surveillance. However, traditional natural language processing (NLP) pipelines are limited, so we propose a method combining word embedding with a convolutional neural network (CNN). OBJECTIVE Our objective was to compare the performance of traditional pipelines (NLP plus supervised machine learning models) with that of word embedding combined with a CNN in conducting a classification task identifying International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis codes in discharge notes. METHODS We used 2 classification methods: (1) extracting from discharge notes some features (terms, n-gram phrases, and SNOMED CT categories) that we used to train a set of supervised machine learning models (support vector machine, random forests, and gradient boosting machine), and (2) building a feature matrix, by a pretrained word embedding model, that we used to train a CNN. We used these methods to identify the chapter-level ICD-10-CM diagnosis codes in a set of discharge notes. We conducted the evaluation using 103,390 discharge notes covering patients hospitalized from June 1, 2015 to January 31, 2017 in the Tri-Service General Hospital in Taipei, Taiwan. We used the receiver operating characteristic curve as an evaluation measure, and calculated the area under the curve (AUC) and F-measure as the global measure of effectiveness. RESULTS In 5-fold cross-validation tests, our method had a higher testing accuracy (mean AUC 0.9696; mean F-measure 0.9086) than traditional NLP-based approaches (mean AUC range 0.8183-0.9571; mean F-measure range 0.5050-0.8739). A real-world simulation that split the training sample and the testing sample by date verified this result (mean AUC 0.9645; mean F-measure 0.9003 using the proposed method). Further analysis showed that the convolutional layers of the CNN effectively identified a large number of keywords and automatically extracted enough concepts to predict the diagnosis codes. CONCLUSIONS Word embedding combined with a CNN showed outstanding performance compared with traditional methods, needing very little data preprocessing. This shows that future studies will not be limited by incomplete dictionaries. A large amount of unstructured information from free-text medical writing will be extracted by automated approaches in the future, and we believe that the health care field is about to enter the age of big data.

Download Full-text

Analysis of Machine Learning Algorithms for Opinion Mining in Different Domains

Machine Learning and Knowledge Extraction ◽

10.3390/make1010014 ◽

2018 ◽

Vol 1 (1) ◽

pp. 224-234 ◽

Cited By ~ 5

Author(s):

Donia Gamal ◽

Marco Alfonse ◽

El-Sayed M. El-Horbaty ◽

Abdel-Badeeh M. Salem

Keyword(s):

Machine Learning ◽

Language Processing ◽

Opinion Mining ◽

Machine Learning Algorithms ◽

Stochastic Gradient Descent ◽

Support Vector ◽

Second Phase ◽

Adaptive Boosting ◽

Two Phases ◽

Textual Content

Sentiment classification (SC) is a reference to the task of sentiment analysis (SA), which is a subfield of natural language processing (NLP) and is used to decide whether textual content implies a positive or negative review. This research focuses on the various machine learning (ML) algorithms which are utilized in the analyzation of sentiments and in the mining of reviews in different datasets. Overall, an SC task consists of two phases. The first phase deals with feature extraction (FE). Three different FE algorithms are applied in this research. The second phase covers the classification of the reviews by using various ML algorithms. These are Naïve Bayes (NB), Stochastic Gradient Descent (SGD), Support Vector Machines (SVM), Passive Aggressive (PA), Maximum Entropy (ME), Adaptive Boosting (AdaBoost), Multinomial NB (MNB), Bernoulli NB (BNB), Ridge Regression (RR) and Logistic Regression (LR). The performance of PA with a unigram is the best among other algorithms for all used datasets (IMDB, Cornell Movies, Amazon and Twitter) and provides values that range from 87% to 99.96% for all evaluation metrics.

Download Full-text

An explainable supervised machine learning predictor of acute kidney injury after adult deceased donor liver transplantation

Journal of Translational Medicine ◽

10.1186/s12967-021-02990-4 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Yihan Zhang ◽

Dong Yang ◽

Zifeng Liu ◽

Chaojin Chen ◽

Mian Ge ◽

...

Keyword(s):

Machine Learning ◽

Decision Making ◽

Liver Transplantation ◽

Acute Kidney Injury ◽

Kidney Injury ◽

Supervised Machine Learning ◽

Gradient Boosting ◽

Support Vector ◽

Adaptive Boosting ◽

Validation Set

Abstract Background Early prediction of acute kidney injury (AKI) after liver transplantation (LT) facilitates timely recognition and intervention. We aimed to build a risk predictor of post-LT AKI via supervised machine learning and visualize the mechanism driving within to assist clinical decision-making. Methods Data of 894 cases that underwent liver transplantation from January 2015 to September 2019 were collected, covering demographics, donor characteristics, etiology, peri-operative laboratory results, co-morbidities and medications. The primary outcome was new-onset AKI after LT according to Kidney Disease Improving Global Outcomes guidelines. Predicting performance of five classifiers including logistic regression, support vector machine, random forest, gradient boosting machine (GBM) and adaptive boosting were respectively evaluated by the area under the receiver-operating characteristic curve (AUC), accuracy, F1-score, sensitivity and specificity. Model with the best performance was validated in an independent dataset involving 195 adult LT cases from October 2019 to March 2021. SHapley Additive exPlanations (SHAP) method was applied to evaluate feature importance and explain the predictions made by ML algorithms. Results 430 AKI cases (55.1%) were diagnosed out of 780 included cases. The GBM model achieved the highest AUC (0.76, CI 0.70 to 0.82), F1-score (0.73, CI 0.66 to 0.79) and sensitivity (0.74, CI 0.66 to 0.8) in the internal validation set, and a comparable AUC (0.75, CI 0.67 to 0.81) in the external validation set. High preoperative indirect bilirubin, low intraoperative urine output, long anesthesia time, low preoperative platelets, and graft steatosis graded NASH CRN 1 and above were revealed by SHAP method the top 5 important variables contributing to the diagnosis of post-LT AKI made by GBM model. Conclusions Our GBM-based predictor of post-LT AKI provides a highly interoperable tool across institutions to assist decision-making after LT. Graphic abstract

Download Full-text

An explainable supervised machine learning predictor of acute kidney injury after adult deceased donor liver transplantation

10.21203/rs.3.rs-442049/v1 ◽

2021 ◽

Author(s):

Yihan Zhang ◽

Dong Yang ◽

Zifeng Liu ◽

Chaojin Chen ◽

Mian Ge ◽

...

Keyword(s):

Machine Learning ◽

Decision Making ◽

Liver Transplantation ◽

Acute Kidney Injury ◽

Clinical Decision Making ◽

Kidney Injury ◽

Supervised Machine Learning ◽

Gradient Boosting ◽

Support Vector ◽

Adaptive Boosting

Abstract Background: Early prediction of acute kidney injury (AKI) after liver transplantation (LT) facilitates timely recognition and intervention. We aimed to build a risk predictor of post-LT AKI via supervised machine learning and visualize the mechanism driving within to assist clinical decision-making.Methods: Data of 894 cases that underwent liver transplantation from January 2015 to September 2019 were collected, covering demographics, donor characteristics, etiology, peri-operative laboratory results, co-morbidities and medications. The primary outcome was new-onset AKI after LT according to Kidney Disease Improving Global Outcomes guidelines. Predicting performance of five classifiers including logistic regression, support vector machine, random forest, gradient boosting machine (GBM) and adaptive boosting were respectively evaluated by the area under the receiver-operating characteristic curve (AUC), accuracy, F1-score, sensitivity and specificity. SHapley Additive exPlanations (SHAP) method was applied to evaluate feature importance and explain the predictions made by ML algorithms.Results: 430 AKI cases (55.1%) were diagnosed out of 780 included cases. The GBM model achieved the highest AUC (0.76, CI 0.70 to 0.82), F1-score (0.73, CI 0.66to 0.79) and sensitivity (0.74, CI 0.66 to 0.8). High preoperative indirect bilirubin, low intraoperative urine output, long anesthesia time, low preoperative platelets, and graft steatosis graded NASH CRN 1 and above were revealed by SHAP method the top 5 important variables contributing to the diagnosis of post-LT AKI made by GBM model.Conclusions: Our GBM-based predictor of post-LT AKI provides a highly interoperable tool across institutions to assist decision-making after LT.

Download Full-text

A Machine Learning Way to Classify Autism Spectrum Disorder

International Journal of Emerging Technologies in Learning (iJET) ◽

10.3991/ijet.v16i06.19559 ◽

2021 ◽

Vol 16 (06) ◽

pp. 182

Author(s):

Sujatha R ◽

Aarthy SL ◽

Jyotir Moy Chatterjee ◽

A. Alaboudi ◽

NZ Jhanjhi

Keyword(s):

Machine Learning ◽

Autism Spectrum Disorder ◽

Autism Spectrum ◽

Spectrum Disorder ◽

Stochastic Gradient Descent ◽

Screening Tests ◽

Support Vector ◽

Attention Switching ◽

Adaptive Boosting ◽

Over The Top

In recent times Autism Spectrum Disorder (ASD) is picking up its force quicker than at any other time. Distinguishing autism characteristics through screening tests is over the top expensive and tedious. Screening of the same is a challenging task, and classification must be conducted with great care. Machine Learning (ML) can perform great in the classification of this problem. Most researchers have utilized the ML strategy to characterize patients and typical controls, among which support vector machines (SVM) are broadly utilized. Even though several studies have been done utilizing various methods, these investigations didn't give any complete decision about anticipating autism qualities regarding distinctive age groups. Accordingly, this paper plans to locate the best technique for ASD classi-fication out of SVM, K-nearest neighbor (KNN), Random Forest (RF), Naïve Bayes (NB), Stochastic gradient descent (SGD), Adaptive boosting (AdaBoost), and CN2 Rule Induction using 4 ASD datasets taken from UCI ML repository. The classification accuracy (CA) we acquired after experimentation is as follows: in the case of the adult dataset SGD gives 99.7%, in the adolescent dataset RF gives 97.2%, in the child dataset SGD gives 99.6%, in the toddler dataset Ada-Boost gives 99.8%. Autism spectrum quotients (AQs) varied among several sce-narios for toddlers, adults, adolescents, and children that include positive predic-tive value for the scaling purpose. AQ questions referred to topics about attention to detail, attention switching, communication, imagination, and social skills.

Download Full-text

Leveraging Machine Learning Algorithms For Zero-Day Ransomware Attack

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f8694.088619 ◽

2019 ◽

Vol 8 (6) ◽

pp. 4104-4107

Keyword(s):

Machine Learning ◽

Random Forest ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Support Vector ◽

K Nearest Neighbor ◽

Supervised Learning Algorithms ◽

Microsoft Windows

Current global huge cyber protection attacks resulting from Infected Encryption ransomware structures over all international locations and businesses with millions of greenbacks lost in paying compulsion abundance. This type of malware encrypts consumer files, extracts consumer files, and charges higher ransoms to be paid for decryption of keys. An attacker could use different types of ransomware approach to steal a victim's files. Some of ransomware attacks like Scareware, Mobile ransomware, WannaCry, CryptoLocker, Zero-Day ransomware attack etc. A zero-day vulnerability is a software program security flaw this is regarded to the software seller however doesn’t have patch in vicinity to restore a flaw. Despite the fact that machine learning algorithms are already used to find encryption Ransomware. This is based on the analysis of a large number of PE file data Samples (benign software and ransomware utility) makes use of supervised machine learning algorithms for ascertain Zero-day attacks. This work was done on a Microsoft Windows operating system (the most attacked os through encryption ransomware) and estimated it. We have used four Supervised learning Algorithms, Random Forest Classifier , K-Nearest Neighbor, Support Vector Machine and Logistic Regression. Tests using machine learning algorithms evaluate almost null false positives with a 99.5% accuracy with a random forest algorithm.

Download Full-text