Topic Categorisation of Statements in Suicide Notes with Integrated Rules and Machine Learning

We describe and evaluate an automated approach used as part of the i2b2 2011 challenge to identify and categorise statements in suicide notes into one of 15 topics, including Love, Guilt, Thankfulness, Hopelessness and Instructions. The approach combines a set of lexico-syntactic rules with a set of models derived by machine learning from a training dataset. The machine learning models rely on named entities, lexical, lexico-semantic and presentation features, as well as the rules that are applicable to a given statement. On a testing set of 300 suicide notes, the approach showed the overall best micro F-measure of up to 53.36%. The best precision achieved was 67.17% when only rules are used, whereas best recall of 50.57% was with integrated rules and machine learning. While some topics (eg, Sorrow, Anger, Blame) prove challenging, the performance for relatively frequent (eg, Love) and well-scoped categories (eg, Thankfulness) was comparatively higher (precision between 68% and 79%), suggesting that automated text mining approaches can be effective in topic categorisation of suicide notes.

Download Full-text

Uncovering and Correcting Shortcut Learning in Machine Learning Models for Skin Cancer Diagnosis

Diagnostics ◽

10.3390/diagnostics12010040 ◽

2021 ◽

Vol 12 (1) ◽

pp. 40

Author(s):

Meike Nauta ◽

Ricky Walsh ◽

Adam Dubowski ◽

Christin Seifert

Keyword(s):

Machine Learning ◽

Clinical Practice ◽

Skin Cancer ◽

Cancer Diagnosis ◽

Image Inpainting ◽

Relevant Information ◽

Black Box ◽

Training Dataset ◽

Learning Models ◽

Machine Learning Models

Machine learning models have been successfully applied for analysis of skin images. However, due to the black box nature of such deep learning models, it is difficult to understand their underlying reasoning. This prevents a human from validating whether the model is right for the right reasons. Spurious correlations and other biases in data can cause a model to base its predictions on such artefacts rather than on the true relevant information. These learned shortcuts can in turn cause incorrect performance estimates and can result in unexpected outcomes when the model is applied in clinical practice. This study presents a method to detect and quantify this shortcut learning in trained classifiers for skin cancer diagnosis, since it is known that dermoscopy images can contain artefacts. Specifically, we train a standard VGG16-based skin cancer classifier on the public ISIC dataset, for which colour calibration charts (elliptical, coloured patches) occur only in benign images and not in malignant ones. Our methodology artificially inserts those patches and uses inpainting to automatically remove patches from images to assess the changes in predictions. We find that our standard classifier partly bases its predictions of benign images on the presence of such a coloured patch. More importantly, by artificially inserting coloured patches into malignant images, we show that shortcut learning results in a significant increase in misdiagnoses, making the classifier unreliable when used in clinical practice. With our results, we, therefore, want to increase awareness of the risks of using black box machine learning models trained on potentially biased datasets. Finally, we present a model-agnostic method to neutralise shortcut learning by removing the bias in the training dataset by exchanging coloured patches with benign skin tissue using image inpainting and re-training the classifier on this de-biased dataset.

Download Full-text

Assessment of Machine Learning Models to Identify Port Jackson Shark Behaviours Using Tri-Axial Accelerometers

Sensors ◽

10.3390/s20247096 ◽

2020 ◽

Vol 20 (24) ◽

pp. 7096

Author(s):

Julianna P. Kadar ◽

Monique A. Ladds ◽

Joanna Day ◽

Brianne Lyall ◽

Culum Brown

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Classification Tree ◽

Support Vector ◽

Fine Scale ◽

Learning Models ◽

Port Jackson ◽

F Measure ◽

Machine Learning Models ◽

Broad Scale

Movement ecology has traditionally focused on the movements of animals over large time scales, but, with advancements in sensor technology, the focus can become increasingly fine scale. Accelerometers are commonly applied to quantify animal behaviours and can elucidate fine-scale (<2 s) behaviours. Machine learning methods are commonly applied to animal accelerometry data; however, they require the trial of multiple methods to find an ideal solution. We used tri-axial accelerometers (10 Hz) to quantify four behaviours in Port Jackson sharks (Heterodontus portusjacksoni): two fine-scale behaviours (<2 s)—(1) vertical swimming and (2) chewing as proxy for foraging, and two broad-scale behaviours (>2 s–mins)—(3) resting and (4) swimming. We used validated data to calculate 66 summary statistics from tri-axial accelerometry and assessed the most important features that allowed for differentiation between the behaviours. One and two second epoch testing sets were created consisting of 10 and 20 samples from each behaviour event, respectively. We developed eight machine learning models to assess their overall accuracy and behaviour-specific accuracy (one classification tree, five ensemble learners and two neural networks). The support vector machine model classified the four behaviours better when using the longer 2 s time epoch (F-measure 89%; macro-averaged F-measure: 90%). Here, we show that this support vector machine (SVM) model can reliably classify both fine- and broad-scale behaviours in Port Jackson sharks.

Download Full-text

Application of Natural Language Processing with Supervised Machine Learning Techniques to Predict the Overall Drugs Performance

AJIT-e Online Academic Journal of Information Technology ◽

10.5824/ajite.2020.01.001.x ◽

2020 ◽

Vol 11 (40) ◽

pp. 8-23

Author(s):

Pius MARTHIN ◽

Duygu İÇEN

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Semantic Analysis ◽

Classification Tree ◽

Supervised Machine Learning ◽

Training Dataset ◽

Support Vector ◽

Learning Models ◽

Machine Learning Models

Online product reviews have become a valuable source of information which facilitate customer decision with respect to a particular product. With the wealthy information regarding user's satisfaction and experiences about a particular drug, pharmaceutical companies make the use of online drug reviews to improve the quality of their products. Machine learning has enabled scientists to train more efficient models which facilitate decision making in various fields. In this manuscript we applied a drug review dataset used by (Gräβer, Kallumadi, Malberg,& Zaunseder, 2018), available freely from machine learning repository website of the University of California Irvine (UCI) to identify best machine learning model which provide a better prediction of the overall drug performance with respect to users' reviews. Apart from several manipulations done to improve model accuracy, all necessary procedures required for text analysis were followed including text cleaning and transformation of texts to numeric format for easy training machine learning models. Prior to modeling, we obtained overall sentiment scores for the reviews. Customer's reviews were summarized and visualized using a bar plot and word cloud to explore the most frequent terms. Due to scalability issues, we were able to use only the sample of the dataset. We randomly sampled 15000 observations from the 161297 training dataset and 10000 observations were randomly sampled from the 53766 testing dataset. Several machine learning models were trained using 10 folds cross-validation performed under stratified random sampling. The trained models include Classification and Regression Trees (CART), classification tree by C5.0, logistic regression (GLM), Multivariate Adaptive Regression Spline (MARS), Support vector machine (SVM) with both radial and linear kernels and a classification tree using random forest (Random Forest). Model selection was done through a comparison of accuracies and computational efficiency. Support vector machine (SVM) with linear kernel was significantly best with an accuracy of 83% compared to the rest. Using only a small portion of the dataset, we managed to attain reasonable accuracy in our models by applying the TF-IDF transformation and Latent Semantic Analysis (LSA) technique to our TDM.

Download Full-text

Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes (Preprint)

10.2196/preprints.8344 ◽

2017 ◽

Author(s):

Chin Lin ◽

Chia-Jung Hsu ◽

Yu-Sheng Lou ◽

Shih-Jen Yeh ◽

Chia-Cheng Lee ◽

...

Keyword(s):

Machine Learning ◽

Word Embedding ◽

Supervised Machine Learning ◽

Support Vector ◽

Free Text ◽

Learning Models ◽

Diagnosis Codes ◽

Icd 10 ◽

F Measure ◽

Machine Learning Models

BACKGROUND Automated disease code classification using free-text medical information is important for public health surveillance. However, traditional natural language processing (NLP) pipelines are limited, so we propose a method combining word embedding with a convolutional neural network (CNN). OBJECTIVE Our objective was to compare the performance of traditional pipelines (NLP plus supervised machine learning models) with that of word embedding combined with a CNN in conducting a classification task identifying International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis codes in discharge notes. METHODS We used 2 classification methods: (1) extracting from discharge notes some features (terms, n-gram phrases, and SNOMED CT categories) that we used to train a set of supervised machine learning models (support vector machine, random forests, and gradient boosting machine), and (2) building a feature matrix, by a pretrained word embedding model, that we used to train a CNN. We used these methods to identify the chapter-level ICD-10-CM diagnosis codes in a set of discharge notes. We conducted the evaluation using 103,390 discharge notes covering patients hospitalized from June 1, 2015 to January 31, 2017 in the Tri-Service General Hospital in Taipei, Taiwan. We used the receiver operating characteristic curve as an evaluation measure, and calculated the area under the curve (AUC) and F-measure as the global measure of effectiveness. RESULTS In 5-fold cross-validation tests, our method had a higher testing accuracy (mean AUC 0.9696; mean F-measure 0.9086) than traditional NLP-based approaches (mean AUC range 0.8183-0.9571; mean F-measure range 0.5050-0.8739). A real-world simulation that split the training sample and the testing sample by date verified this result (mean AUC 0.9645; mean F-measure 0.9003 using the proposed method). Further analysis showed that the convolutional layers of the CNN effectively identified a large number of keywords and automatically extracted enough concepts to predict the diagnosis codes. CONCLUSIONS Word embedding combined with a CNN showed outstanding performance compared with traditional methods, needing very little data preprocessing. This shows that future studies will not be limited by incomplete dictionaries. A large amount of unstructured information from free-text medical writing will be extracted by automated approaches in the future, and we believe that the health care field is about to enter the age of big data.

Download Full-text

The Adequacy Assessment of Test Sets in Machine Learning using Mutation Testing

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1183.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 4390-4395

Keyword(s):

Machine Learning ◽

Research Question ◽

Training Dataset ◽

Data Repository ◽

Learning Models ◽

Data Set ◽

Test Dataset ◽

Test Sets ◽

Adequacy Assessment ◽

Machine Learning Models

The accuracy is computed by applying the test dataset to the model that has been trained using the training dataset. Thus, The test dataset in machine learning is expected to be able to validate whether a trained model is sufficiently accurate for use. This study addresses this issue in the form of the research question, “how adequate is the test dataset used in machine learning models to validate the models.” To answer this question, the study takes seven most-popular datasets registered in the UCI machine learning data repository, and applies the data sets to the six difference machine learning models. We do an empirical study to analyze how adequate the test sets are, which are used in validating machine learning models. The testing adequacy for each model and each data set is analyzed by mutation analysis technique.

Download Full-text

Detecting Plasma Detachment in the Wendelstein 7-X Stellarator Using Machine Learning

Applied Sciences ◽

10.3390/app12010269 ◽

2021 ◽

Vol 12 (1) ◽

pp. 269

Author(s):

Máté Szűcs ◽

Tamás Szepesi ◽

Christoph Biedermann ◽

Gábor Cseh ◽

Marcin Jakubowski ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Supervised Machine Learning ◽

Training Dataset ◽

Computational Time ◽

Learning Models ◽

Camera System ◽

Pixel Intensity ◽

On The Road ◽

Machine Learning Models

The detachment regime has a high potential to play an important role in fusion devices on the road to a fusion power plant. Complete power detachment has been observed several times during the experimental campaigns of the Wendelstein 7-X (W7-X) stellarator. Automatic observation and signaling of such events could help scientists to better understand these phenomena. With the growing discharge times in fusion devices, machine learning models and algorithms are a powerful tool to process the increasing amount of data. We investigate several classical supervised machine learning models to detect complete power detachment in the images captured by the Event Detection Intelligent Camera System (EDICAM) at the W7-X at each given image frame. In the dedicated detached state the plasma is stable despite its reduced contact with the machine walls and the radiation belt stays close to the separatrix, without exhibiting significant heat load onto the divertor. To decrease computational time and resources needed we propose certain pixel intensity profiles (or intensity values along lines) as the input to these models. After finding the profile that describes the images best in terms of detachment, we choose the best performing machine learning algorithm. It achieves an F1 score of 0.9836 on the training dataset and 0.9335 on the test set. Furthermore, we investigate its predictions in other scenarios, such as plasmas with substantially decreased minor radius and several magnetic configurations.

Download Full-text

Detection of Breast Cancer Using Machine Learning Algorithms

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit217141 ◽

2021 ◽

pp. 223-227

Author(s):

Vijaylaxmi Kochari

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Linear Regression ◽

Decision Tree ◽

Machine Learning Algorithms ◽

Training Dataset ◽

Proper Treatment ◽

Learning Models ◽

Initial Stage ◽

Machine Learning Models

Breast cancer represents one of the dangerous diseases that causes a high number of deaths every year. The dataset containing the features present in the CSV format is used to identify whether the digitalized image is benign or malignant. The machine learning models such as Linear Regression, Decision Tree, Radom Forest are trained with the training dataset and used to classify. The accuracy of these classifiers is compared to get the best model. This will help the doctors to give proper treatment at the initial stage and save their lives.

Download Full-text

SELM: Software Engineering of Machine Learning Models

10.3233/faia210007 ◽

2021 ◽

Author(s):

Nafiseh Jafari ◽

Mohammad Reza Besharati ◽

Maryam Hourali

Keyword(s):

Machine Learning ◽

Software Engineering ◽

Interdisciplinary Approach ◽

Interdisciplinary Teams ◽

Process Efficiency ◽

Training Dataset ◽

Learning Models ◽

Machine Learning Model ◽

Machine Learning Models

One of the pillars of any machine learning model is its concepts. Using software engineering, we can engineer these concepts and then develop and expand them. In this article, we present a SELM framework for Software Engineering of machine Learning Models. We then evaluate this framework through a case study. Using the SELM framework, we can improve a machine learning process efficiency and provide more accuracy in learning with less processing hardware resources and a smaller training dataset. This issue highlights the importance of an interdisciplinary approach to machine learning. Therefore, in this article, we have provided interdisciplinary teams’ proposals for machine learning.

Download Full-text

The use of machine learning methods in the development of nasal dosage forms with cerebroprotective action

Current issues in pharmacy and medicine science and practice ◽

10.14739/2409-2932.2021.2.232053 ◽

2021 ◽

Vol 14 (2) ◽

pp. 232-238

Author(s):

B. S. Burlaka ◽

I. F. Bielenichev

Keyword(s):

Machine Learning ◽

In Silico ◽

High Reliability ◽

Binary Classification ◽

Dosage Forms ◽

Training Dataset ◽

Learning Models ◽

Pharmaceutical Ingredients ◽

Machine Learning Models ◽

Rational Composition

In order to save resource of active pharmaceutical ingredients and excipients, in the early stages of research, when planning an experiment, it is advisable to use data of the predicted and experimental physicochemical properties stored in different aggregation databases. The information found will reduce the time for composition development and for technology processing. However, the variety of active compounds characteristics and excipients is not always reflected in these services. Recently, machine learning models have been widely used in various scientific fields; they allow to obtain predictions with high reliability. Given the above, it is relevant and promising to develop models of machine learning to predict the presence of pharmaceutical incompatibilities in the formulation of nasal dosage forms. The aim of the study is to develop models of machine learning for in silico forecast of the rational composition of nasal dosage forms with cerebroprotective action. Materials and methods. A dataset, containing data on compounds (active and auxiliary) and characteristics on the presence or absence of interaction (pharmaceutical incompatibility), was used as material. Training datasets were filled by content analysis of PubMed library data (pubmed.ncbi.nlm.nih.gov) manually, by keywords “pharmaceutical incompatibilities”, “physico-chemical compatibility”, “incompatible excipients”) for the last 10 years. The resulting dataset comprises 1185 lines. The methods employed were a set of methods for binary classification of machine learning (pycaret.org) using the programming language Python 3.8 (python.org) in the package management environment Miniconda (conda.io). Pipeline programming was performed using Jupyter notebook package (jupyter.org). The generation of MACCS (Molecular ACCess System keys) in the training dataset was performed using RDKit package (rdkit.org). Specifications of the simplified representation of molecules in the input line (SMILES), in automatic mode, were searched using PubChem service (pubchem.ncbi.nlm.nih.gov). Results. The obtained data allowed to choose two perspective models of machine learning of binary classification, whose quality was checked on a dataset for verification. Statistical evaluations of the selected models indicate a high probability of in silico prognosis for the presence or absence of pharmaceutical incompatibilities in the development of nasal formulations of cerebroprotective dosage forms. They are posted on the web server of the expert system ExpSys Nasalia (nasalia.zsmu.zp.ua) in the calculations section. Conclusions. As a result of our research, we have developed machine learning models for in silico prediction of the rational composition of nasal dosage forms with cerebroprotective action. Confirmation of the quality of the pharmaceutical incompatibilities prediction, using the developed models, is checked on a dataset for check. The statistical indicators of the tree_blender (AUC 0.9521, F1 0.9747, MCC 0.9094) and boost_blender (AUC 0.9593, F1 0.9821, MCC 0.9352) models were obtained. The use of machine learning models in pharmaceutical development will contribute to resource conservation and optimization of the composition of the formulation.

Download Full-text

Direct Comparison of the Prediction of the Unbound Brain-to-Plasma Partitioning Utilizing Machine Learning Approach and Mechanistic Neuropharmacokinetic Model

The AAPS Journal ◽

10.1208/s12248-021-00604-x ◽

2021 ◽

Vol 23 (4) ◽

Author(s):

Yohei Kosugi ◽

Kunihiko Mizuno ◽

Cipriano Santos ◽

Sho Sato ◽

Natalie Hosea ◽

...

Keyword(s):

Machine Learning ◽

Multiple Drug Resistance ◽

Predictive Performance ◽

Training Dataset ◽

Multiple Drug ◽

Learning Approach ◽

Cancer Resistance ◽

Learning Models ◽

Machine Learning Approach ◽

Machine Learning Models

AbstractThe mechanistic neuropharmacokinetic (neuroPK) model was established to predict unbound brain-to-plasma partitioning (Kp,uu,brain) by considering in vitro efflux activities of multiple drug resistance 1 (MDR1) and breast cancer resistance protein (BCRP). Herein, we directly compare this model to a computational machine learning approach utilizing physicochemical descriptors and efflux ratios of MDR1 and BCRP-expressing cells for predicting Kp,uu,brain in rats. Two different types of machine learning techniques, Gaussian processes (GP) and random forest regression (RF), were assessed by the time and cluster-split validation methods using 640 internal compounds. The predictivity of machine learning models based on only molecular descriptors in the time-split dataset performed worse than the cluster-split dataset, whereas the models incorporating MDR1 and BCRP efflux ratios showed similar predictivity between time and cluster-split datasets. The GP incorporating MDR1 and BCRP in the time-split dataset achieved the highest correlation (R2 = 0.602). These results suggested that incorporation of MDR1 and BCRP in machine learning is beneficial for robust and accurate prediction. Kp,uu,brain prediction utilizing the neuroPK model was significantly worse compared to machine learning approaches for the same dataset. We also investigated the predictivity of Kp,uu,brain using an external independent test set of 34 marketed drugs. Compared to machine learning models, the neuroPK model showed better predictive performance with R2 of 0.577. This work demonstrates that the machine learning model for Kp,uu,brain achieves maximum predictive performance within the chemical applicability domain, whereas the neuroPK model is applicable more widely beyond the chemical space covered in the training dataset.

Download Full-text