Clinical oncology textual notes analysis using machine learning and deep learning (Preprint)

2021 ◽  
Author(s):  
Diego Pinheiro da Silva ◽  
MARCO A. SCHWERTNER ◽  
Sandro José Rigo

UNSTRUCTURED The textual analysis and classification are important research topics, as advances in this field can foster quality in existing clinical systems. Our research explored experimentally text classification methods applied in non-synthetic oncology clinical notes corpora. The experiments were performed in a dataset with 3,308 medical notes. Experiments evaluated the following machine learning and deep learning classification methods: Multilayer Perceptron Neural network, Logistic Regression, Decision Tree classifier, Random Forest classifier, K-nearest neighbors classifier, and Long-Short Term Memory. An experiment evaluated the influence of the corpora preprocessing step on the results, allowing to identify that the classifier’s mean accuracy was leveraged from 26.1% to 86.7% with the per-clinical-event corpus and 93.9% with the per-patient corpus. The best-performing classifier was the Multilayer Perceptron, which achieved 93.90% accuracy, a Macro F1 score of 93.61%, and a Weighted F1 score of 93.99%.

Activity recognition in humans is one of the active challenges that finds its application in numerous fields such as, medical health care, military, manufacturing, assistive techniques and gaming. Due to the advancements in technologies the usage of smartphones in human lives become inevitable. The sensors in the smartphones help us to measure the essential vital parameters. These measured parameters enable us to monitor the activities of humans, which we call as human activity recognition. In this paper, we have proposed an automatic human activity recognition system that independently recognizes the actions of the humans. Four deep learning approaches and thirteen different machine learning classifiers such as Multilayer Perceptron, Random Forest, Support Vector Machine, Decision Tree Classifier, AdaBoost Classifier, Gradient Boosting Classifier and others are applied to identify the efficient classifier for human activity recognition. Our proposed system is able to recognize the activities such as Laying, Sitting, Standing, Walking, Walking downstairs and Walking upstairs. Benchmark dataset has been used to evaluate all the classifiers implemented. We have investigated all these classifiers to identify a best suitable classifier for this dataset. The results obtained show that, the Multilayer Perceptron has obtained 98.46% of overall accuracy in detecting the activities. The second-best performance was observed when the classifiers are combined together.


Deriving the methodologies to detect heart issues at an earlier stage and intimating the patient to improve their health. To resolve this problem, we will use Machine Learning techniques to predict the incidence at an earlier stage. We have a tendency to use sure parameters like age, sex, height, weight, case history, smoking and alcohol consumption and test like pressure ,cholesterol, diabetes, ECG, ECHO for prediction. In machine learning there are many algorithms which will be used to solve this issue. The algorithms include K-Nearest Neighbour, Support vector classifier, decision tree classifier, logistic regression and Random Forest classifier. Using these parameters and algorithms we need to predict whether or not the patient has heart disease or not and recommend the patient to improve his/her health.


The online discussion forums and blogs are very vibrant platforms for cancer patients to express their views in the form of stories. These stories sometimes become a source of inspiration for some patients who are anxious in searching the similar cases. This paper proposes a method using natural language processing and machine learning to analyze unstructured texts accumulated from patient’s reviews and stories. The proposed methodology aims to identify behavior, emotions, side-effects, decisions and demographics associated with the cancer victims. The pre-processing phase of our work involves extraction of web text followed by text-cleaning where some special characters and symbols are omitted, and finally tagging the texts using NLTK’s (Natural Language Toolkit) POS (Parts of Speech) Tagger. The post-processing phase performs training of seven machine learning classifiers (refer Table 6). The Decision Tree classifier shows the higher precision (0.83) among the other classifiers while, the Area under the operating Characteristics (AUC) for Support Vector Machine (SVM) classifier is highest (0.98).


2021 ◽  
Author(s):  
Son Hoang ◽  
Tung Tran ◽  
Tan Nguyen ◽  
Tu Truong ◽  
Duy Pham ◽  
...  

Abstract This paper reports a successful case study of applying machine learning to improve the history matching process, making it easier, less time-consuming, and more accurate, by determining whether Local Grid Refinement (LGR) with transmissibility multiplier is needed to history match gas-condensate wells producing from geologically complex reservoirs as well as determining the required LGR setup to history match those gas-condensate producers. History matching Hai Thach gas-condensate production wells is extremely challenging due to the combined effect of condensate banking, sub-seismic fault network, complex reservoir distribution and connectivity, uncertain HIIP, and lack of PVT data for most reservoirs. In fact, for some wells, many trial simulation runs were conducted before it became clear that LGR with transmissibility multiplier was required to obtain good history matching. In order to minimize this time-consuming trial-and-error process, machine learning was applied in this study to analyze production data using synthetic samples generated by a very large number of compositional sector models so that the need for LGR could be identified before the history matching process begins. Furthermore, machine learning application could also determine the required LGR setup. The method helped provide better models in a much shorter time, and greatly improved the efficiency and reliability of the dynamic modeling process. More than 500 synthetic samples were generated using compositional sector models and divided into separate training and test sets. Multiple classification algorithms such as logistic regression, Gaussian Naive Bayes, Bernoulli Naive Bayes, multinomial Naive Bayes, linear discriminant analysis, support vector machine, K-nearest neighbors, and Decision Tree as well as artificial neural networks were applied to predict whether LGR was used in the sector models. The best algorithm was found to be the Decision Tree classifier, with 100% accuracy on the training set and 99% accuracy on the test set. The LGR setup (size of LGR area and range of transmissibility multiplier) was also predicted best by the Decision Tree classifier with 91% accuracy on the training set and 88% accuracy on the test set. The machine learning model was validated using actual production data and the dynamic models of history-matched wells. Finally, using the machine learning prediction on wells with poor history matching results, their dynamic models were updated and significantly improved.


2021 ◽  
pp. 1-11
Author(s):  
Jesús Miguel García-Gorrostieta ◽  
Aurelio López-López ◽  
Samuel González-López ◽  
Adrián Pastor López-Monroy

Academic theses writing is a complex task that requires the author to be skilled in argumentation. The goal of the academic author is to communicate clear ideas and to convince the reader of the presented claims. However, few students are good arguers, and this is a skill that takes time to master. In this paper, we present an exploration of lexical features used to model automatic detection of argumentative paragraphs using machine learning techniques. We present a novel proposal, which combines the information in the complete paragraph with the detection of argumentative segments in order to achieve improved results for the detection of argumentative paragraphs. We propose two approaches; a more descriptive one, which uses the decision tree classifier with indicators and lexical features; and another more efficient, which uses an SVM classifier with lexical features and a Document Occurrence Representation (DOR). Both approaches consider the detection of argumentative segments to ensure that a paragraph detected as argumentative has indeed segments with argumentation. We achieved encouraging results for both approaches.


Water ◽  
2020 ◽  
Vol 12 (10) ◽  
pp. 2927
Author(s):  
Jiyeong Hong ◽  
Seoro Lee ◽  
Joo Hyun Bae ◽  
Jimin Lee ◽  
Woon Ji Park ◽  
...  

Predicting dam inflow is necessary for effective water management. This study created machine learning algorithms to predict the amount of inflow into the Soyang River Dam in South Korea, using weather and dam inflow data for 40 years. A total of six algorithms were used, as follows: decision tree (DT), multilayer perceptron (MLP), random forest (RF), gradient boosting (GB), recurrent neural network–long short-term memory (RNN–LSTM), and convolutional neural network–LSTM (CNN–LSTM). Among these models, the multilayer perceptron model showed the best results in predicting dam inflow, with the Nash–Sutcliffe efficiency (NSE) value of 0.812, root mean squared errors (RMSE) of 77.218 m3/s, mean absolute error (MAE) of 29.034 m3/s, correlation coefficient (R) of 0.924, and determination coefficient (R2) of 0.817. However, when the amount of dam inflow is below 100 m3/s, the ensemble models (random forest and gradient boosting models) performed better than MLP for the prediction of dam inflow. Therefore, two combined machine learning (CombML) models (RF_MLP and GB_MLP) were developed for the prediction of the dam inflow using the ensemble methods (RF and GB) at precipitation below 16 mm, and the MLP at precipitation above 16 mm. The precipitation of 16 mm is the average daily precipitation at the inflow of 100 m3/s or more. The results show the accuracy verification results of NSE 0.857, RMSE 68.417 m3/s, MAE 18.063 m3/s, R 0.927, and R2 0.859 in RF_MLP, and NSE 0.829, RMSE 73.918 m3/s, MAE 18.093 m3/s, R 0.912, and R2 0.831 in GB_MLP, which infers that the combination of the models predicts the dam inflow the most accurately. CombML algorithms showed that it is possible to predict inflow through inflow learning, considering flow characteristics such as flow regimes, by combining several machine learning algorithms.


2019 ◽  
Vol 9 (11) ◽  
pp. 2375 ◽  
Author(s):  
Riaz Ullah Khan ◽  
Xiaosong Zhang ◽  
Rajesh Kumar ◽  
Abubakar Sharif ◽  
Noorbakhsh Amiri Golilarz ◽  
...  

In recent years, the botnets have been the most common threats to network security since it exploits multiple malicious codes like a worm, Trojans, Rootkit, etc. The botnets have been used to carry phishing links, to perform attacks and provide malicious services on the internet. It is challenging to identify Peer-to-peer (P2P) botnets as compared to Internet Relay Chat (IRC), Hypertext Transfer Protocol (HTTP) and other types of botnets because P2P traffic has typical features of the centralization and distribution. To resolve the issues of P2P botnet identification, we propose an effective multi-layer traffic classification method by applying machine learning classifiers on features of network traffic. Our work presents a framework based on decision trees which effectively detects P2P botnets. A decision tree algorithm is applied for feature selection to extract the most relevant features and ignore the irrelevant features. At the first layer, we filter non-P2P packets to reduce the amount of network traffic through well-known ports, Domain Name System (DNS). query, and flow counting. The second layer further characterized the captured network traffic into non-P2P and P2P. At the third layer of our model, we reduced the features which may marginally affect the classification. At the final layer, we successfully detected P2P botnets using decision tree Classifier by extracting network communication features. Furthermore, our experimental evaluations show the significance of the proposed method in P2P botnets detection and demonstrate an average accuracy of 98.7%.


2020 ◽  
Vol 7 ◽  
Author(s):  
Seung Hoon Yoo ◽  
Hui Geng ◽  
Tin Lok Chiu ◽  
Siu Ki Yu ◽  
Dae Chul Cho ◽  
...  

2020 ◽  
Vol 10 (11) ◽  
pp. 851
Author(s):  
Vincent Chin-Hung Chen ◽  
Tung-Yeh Lin ◽  
Dah-Cherng Yeh ◽  
Jyh-Wen Chai ◽  
Jun-Cheng Weng

Breast cancer is the leading cancer among women worldwide, and a high number of breast cancer patients are struggling with psychological and cognitive disorders. In this study, we aim to use machine learning models to discriminate between chemo-brain participants and healthy controls (HCs) using connectomes (connectivity matrices) and topological coefficients. Nineteen female post-chemotherapy breast cancer (BC) survivors and 20 female HCs were recruited for this study. Participants in both groups received resting-state functional magnetic resonance imaging (rs-fMRI) and generalized q-sampling imaging (GQI). Logistic regression (LR), decision tree classifier (CART), and xgboost (XGB) were the models we adopted for classification. In connectome analysis, LR achieved an accuracy of 79.49% with the functional connectomes and an accuracy of 71.05% with the structural connectomes. In the topological coefficient analysis, accuracies of 87.18%, 82.05%, and 83.78% were obtained by the functional global efficiency with CART, the functional global efficiency with XGB, and the structural transitivity with CART, respectively. The areas under the curves (AUCs) were 0.93, 0.94, 0.87, 0.88, and 0.84, respectively. Our study showed the discriminating ability of functional connectomes, structural connectomes, and global efficiency. We hope our findings can contribute to an understanding of the chemo brain and the establishment of a clinical system for tracking chemo brain.


Computers ◽  
2019 ◽  
Vol 8 (1) ◽  
pp. 4 ◽  
Author(s):  
Jurgita Kapočiūtė-Dzikienė ◽  
Robertas Damaševičius ◽  
Marcin Woźniak

We describe the sentiment analysis experiments that were performed on the Lithuanian Internet comment dataset using traditional machine learning (Naïve Bayes Multinomial—NBM and Support Vector Machine—SVM) and deep learning (Long Short-Term Memory—LSTM and Convolutional Neural Network—CNN) approaches. The traditional machine learning techniques were used with the features based on the lexical, morphological, and character information. The deep learning approaches were applied on the top of two types of word embeddings (Vord2Vec continuous bag-of-words with negative sampling and FastText). Both traditional and deep learning approaches had to solve the positive/negative/neutral sentiment classification task on the balanced and full dataset versions. The best deep learning results (reaching 0.706 of accuracy) were achieved on the full dataset with CNN applied on top of the FastText embeddings, replaced emoticons, and eliminated diacritics. The traditional machine learning approaches demonstrated the best performance (0.735 of accuracy) on the full dataset with the NBM method, replaced emoticons, restored diacritics, and lemma unigrams as features. Although traditional machine learning approaches were superior when compared to the deep learning methods; deep learning demonstrated good results when applied on the small datasets.


Sign in / Sign up

Export Citation Format

Share Document