SALTClass: classifying clinical short notes using background knowledge from unlabeled data

Mapping Intimacies ◽

10.1101/801944 ◽

2019 ◽

Author(s):

Ayoub Bagheri ◽

Daniel Oberski ◽

Arjan Sammani ◽

Peter G.M. van der Heijden ◽

Folkert W. Asselbergs

Keyword(s):

Machine Learning ◽

Language Processing ◽

Text Classification ◽

Latent Dirichlet Allocation ◽

Machine Learning Algorithms ◽

Unlabeled Data ◽

Specific Information ◽

Short Text ◽

Link Type ◽

Python Package

AbstractBackgroundWith the increasing use of unstructured text in electronic health records, extracting useful related information has become a necessity. Text classification can be applied to extract patients’ medical history from clinical notes. However, the sparsity in clinical short notes, that is, excessively small word counts in the text, can lead to large classification errors. Previous studies demonstrated that natural language processing (NLP) can be useful in the text classification of clinical outcomes. We propose incorporating the knowledge from unlabeled data, as this may alleviate the problem of short noisy sparse text.ResultsThe software package SALTClass (short and long text classifier) is a machine learning NLP toolkit. It uses seven clustering algorithms, namely, latent Dirichlet allocation, K-Means, MiniBatchK-Means, BIRCH, MeanShift, DBScan, and GMM. Smoothing methods are applied to the resulting cluster information to enrich the representation of sparse text. For the subsequent prediction step, SALTClass can be used on either the original document-term matrix or in an enrichment pipeline. To this end, ten different supervised classifiers have also been integrated into SALTClass. We demonstrate the effectiveness of the SALTClass NLP toolkit in the identification of patients’ family history in a Dutch clinical cardiovascular text corpus from University Medical Center Utrecht, the Netherlands.ConclusionsThe considerable amount of unstructured short text in healthcare applications, particularly in clinical cardiovascular notes, has created an urgent need for tools that can parse specific information from text reports. Using machine learning algorithms for enriching short text can improve the representation for further applications.AvailabilitySALTClass can be downloaded as a Python package from Python Package Index (PyPI) website athttps://pypi.org/project/saltclassand from GitHub athttps://github.com/bagheria/saltclass.

Stress Detection using Natural Language Processing and Machine Learning over social Interactions

10.21203/rs.3.rs-994868/v1 ◽

2021 ◽

Author(s):

Tanya Nijhawan ◽

Girija Attigeri ◽

Ananthakrishna T

Keyword(s):

Machine Learning ◽

Social Interactions ◽

Language Processing ◽

Large Scale ◽

Latent Dirichlet Allocation ◽

Well Being ◽

Machine Learning Algorithms ◽

Textual Data ◽

The Status ◽

Micro Level

Abstract Cyberspace is a vast soapbox for people to post anything that they witness in their day-to-day lives. Subsequently, it can be used as a very effective tool in detecting the stress levels of an individual based on the posts and comments shared by him/her on social networking platforms. We leverage large-scale datasets with tweets to successfully accomplish sentiment analysis with the aid of machine learning algorithms. We take the help of a capable deep learning pre-trained model called BERT to solve the problems which come with sentiment classification. The BERT model outperforms a lot of other well-known models for this job without any sophisticated architecture. We also adopted Latent Dirichlet Allocation which is an unsupervised machine learning method that’s skilled in scanning a group of documents, recognizing the word and phrase patterns within them, and gathering word groups and alike expressions that most precisely illustrate a set of documents. This helps us predict which topic is linked to the textual data. With the aid of the models suggested, we will be able to detect the emotion of users online. We are primarily working with Twitter data because Twitter is a website where people express their thoughts often. In conclusion, this proposal is for the well- being of one’s mental health. The results are evaluated using various metric at macro and micro level and indicate that the trained model detects the status of emotions bases on social interactions.

Short Text Classification with Tolerance Near Sets

10.36939/ir.202108231232 ◽

2021 ◽

Author(s):

◽

Vrushang Patel

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Classification ◽

Sentiment Classification ◽

Data Sets ◽

Short Text ◽

Near Sets ◽

Textual Features

Text classification is a classical machine learning application in Natural Language Processing, which aims to assign labels to textual units such as documents, sentences, paragraphs, and queries. Applications of text classification include sentiment classification and news categorization. Sentiment classification identifies the polarity of text such as positive, negative or neutral based on textual features. In this thesis, we implemented a modified form of a tolerance-based algorithm (TSC) to classify sentiment polarities of tweets as well as news categories from text. The TSC algorithm is a supervised algorithm that was designed to perform short text classification with tolerance near sets (TNS). The proposed TSC algorithm uses pre-trained SBERT algorithm vectors for creating tolerance classes. The effectiveness of the TSC algorithm has been demonstrated by testing it on ten well-researched data sets. One of the datasets (Covid-Sentiment) was hand-crafted with tweets from Twitter of opinions related to COVID. Experiments demonstrate that TSC outperforms five classical ML algorithms with one dataset, and is comparable with all other datasets using a weighted F1-score measure.

ClassificaIO: machine learning for classification graphical user interface

10.1101/240184 ◽

2017 ◽

Cited By ~ 2

Author(s):

Raeuf Roushangar ◽

George I. Mias

Keyword(s):

Machine Learning ◽

User Interface ◽

Graphical User Interface ◽

Machine Learning Algorithms ◽

Machine Learning Classification ◽

Link Type ◽

Research Areas ◽

Testing Data ◽

Microsoft Windows ◽

Python Package

AbstractMachine learning methods are being used routinely by scientists in many research areas, typically requiring significant statistical and programing knowledge. Here we present ClassificaIO, an open-source Python graphical user interface for machine learning classification for the scikit-learn Python library. ClassificaIO provides an interactive way to train, validate, and test data on a range of classification algorithms. The software enables fast comparisons within and across classifiers, and facilitates uploading and exporting of trained models, and both validation and testing data results. ClassificaIO aims to provide not only a research utility, but also an educational tool that can enable biomedical and other researchers with minimal machine learning background to apply machine learning algorithms to their research in an interactive point-and-click way. The ClassificaIO package is available for download and installation through the Python Package Index (PyPI) (http://pypi.python.org/pypi/ClassificaIO) and it can be deployed using the “import” function in Python once the package is installed. The application is distributed under an MIT license and the source code is publicly available for download (for Mac OS X, Linux and Microsoft Windows) through PyPI and GitHub (http://github.com/gmiaslab/ClassificaIO, andhttps://doi.org/10.5281/zenodo.1320465).

A Systematic Literature Review on Using Machine Learning Algorithms for Software Requirements Identification on Stack Overflow

Security and Communication Networks ◽

10.1155/2020/8830683 ◽

2020 ◽

Vol 2020 ◽

pp. 1-19

Author(s):

Arshad Ahmad ◽

Chong Feng ◽

Muzammil Khan ◽

Asif Khan ◽

Ayaz Ullah ◽

...

Keyword(s):

Machine Learning ◽

Literature Review ◽

Language Processing ◽

Systematic Literature Review ◽

Latent Dirichlet Allocation ◽

Machine Learning Algorithms ◽

Software Requirements ◽

Quality Systems ◽

Stack Overflow ◽

Requirements Identification

Context. The improvements made in the last couple of decades in the requirements engineering (RE) processes and methods have witnessed a rapid rise in effectively using diverse machine learning (ML) techniques to resolve several multifaceted RE issues. One such challenging issue is the effective identification and classification of the software requirements on Stack Overflow (SO) for building quality systems. The appropriateness of ML-based techniques to tackle this issue has revealed quite substantial results, much effective than those produced by the usual available natural language processing (NLP) techniques. Nonetheless, a complete, systematic, and detailed comprehension of these ML based techniques is considerably scarce. Objective. To identify or recognize and classify the kinds of ML algorithms used for software requirements identification primarily on SO. Method. This paper reports a systematic literature review (SLR) collecting empirical evidence published up to May 2020. Results. This SLR study found 2,484 published papers related to RE and SO. The data extraction process of the SLR showed that (1) Latent Dirichlet Allocation (LDA) topic modeling is among the widely used ML algorithm in the selected studies and (2) precision and recall are amongst the most commonly utilized evaluation methods for measuring the performance of these ML algorithms. Conclusion. Our SLR study revealed that while ML algorithms have phenomenal capabilities of identifying the software requirements on SO, they still are confronted with various open problems/issues that will eventually limit their practical applications and performances. Our SLR study calls for the need of close collaboration venture between the RE and ML communities/researchers to handle the open issues confronted in the development of some real world machine learning-based quality systems.

Machine learning in medicine: a practical introduction to natural language processing

BMC Medical Research Methodology ◽

10.1186/s12874-021-01347-1 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Conrad J. Harrison ◽

Chris J. Sidey-Gibbons

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Mental Health Problems ◽

Characteristic Curve ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Support Vector

Abstract Background Unstructured text, including medical records, patient feedback, and social media comments, can be a rich source of data for clinical research. Natural language processing (NLP) describes a set of techniques used to convert passages of written text into interpretable datasets that can be analysed by statistical and machine learning (ML) models. The purpose of this paper is to provide a practical introduction to contemporary techniques for the analysis of text-data, using freely-available software. Methods We performed three NLP experiments using publicly-available data obtained from medicine review websites. First, we conducted lexicon-based sentiment analysis on open-text patient reviews of four drugs: Levothyroxine, Viagra, Oseltamivir and Apixaban. Next, we used unsupervised ML (latent Dirichlet allocation, LDA) to identify similar drugs in the dataset, based solely on their reviews. Finally, we developed three supervised ML algorithms to predict whether a drug review was associated with a positive or negative rating. These algorithms were: a regularised logistic regression, a support vector machine (SVM), and an artificial neural network (ANN). We compared the performance of these algorithms in terms of classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity and specificity. Results Levothyroxine and Viagra were reviewed with a higher proportion of positive sentiments than Oseltamivir and Apixaban. One of the three LDA clusters clearly represented drugs used to treat mental health problems. A common theme suggested by this cluster was drugs taking weeks or months to work. Another cluster clearly represented drugs used as contraceptives. Supervised machine learning algorithms predicted positive or negative drug ratings with classification accuracies ranging from 0.664, 95% CI [0.608, 0.716] for the regularised regression to 0.720, 95% CI [0.664,0.776] for the SVM. Conclusions In this paper, we present a conceptual overview of common techniques used to analyse large volumes of text, and provide reproducible code that can be readily applied to other research studies using open-source software.

Text Classification in Clinical Practice Guidelines Using Machine-Learning Assisted Pattern-Based Approach

Applied Sciences ◽

10.3390/app11083296 ◽

2021 ◽

Vol 11 (8) ◽

pp. 3296

Author(s):

Musarrat Hussain ◽

Jamil Hussain ◽

Taqdir Ali ◽

Syed Imran Ali ◽

Hafiz Syed Muhammad Bilal ◽

...

Keyword(s):

Machine Learning ◽

Decision Making ◽

Clinical Practice ◽

Clinical Practice Guidelines ◽

Practice Guidelines ◽

Machine Learning Algorithms ◽

Nominal Group ◽

Specific Information ◽

Matching Techniques ◽

Disease Specific

Clinical Practice Guidelines (CPGs) aim to optimize patient care by assisting physicians during the decision-making process. However, guideline adherence is highly affected by its unstructured format and aggregation of background information with disease-specific information. The objective of our study is to extract disease-specific information from CPG for enhancing its adherence ratio. In this research, we propose a semi-automatic mechanism for extracting disease-specific information from CPGs using pattern-matching techniques. We apply supervised and unsupervised machine-learning algorithms on CPG to extract a list of salient terms contributing to distinguishing recommendation sentences (RS) from non-recommendation sentences (NRS). Simultaneously, a group of experts also analyzes the same CPG and extract the initial patterns “Heuristic Patterns” using a group decision-making method, nominal group technique (NGT). We provide the list of salient terms to the experts and ask them to refine their extracted patterns. The experts refine patterns considering the provided salient terms. The extracted heuristic patterns depend on specific terms and suffer from the specialization problem due to synonymy and polysemy. Therefore, we generalize the heuristic patterns to part-of-speech (POS) patterns and unified medical language system (UMLS) patterns, which make the proposed method generalize for all types of CPGs. We evaluated the initial extracted patterns on asthma, rhinosinusitis, and hypertension guidelines with the accuracy of 76.92%, 84.63%, and 89.16%, respectively. The accuracy increased to 78.89%, 85.32%, and 92.07% with refined machine-learning assistive patterns, respectively. Our system assists physicians by locating disease-specific information in the CPGs, which enhances the physicians’ performance and reduces CPG processing time. Additionally, it is beneficial in CPGs content annotation.

mSphere of Influence: the Rise of Artificial Intelligence in Infection Biology

mSphere ◽

10.1128/msphere.00315-19 ◽

2019 ◽

Vol 4 (3) ◽

Cited By ~ 2

Author(s):

Artur Yakimovich

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Label Free ◽

Edge Label ◽

Bacterial Colony ◽

Link Type ◽

Colony Counting ◽

Anthrax Spores ◽

Infection Biology

ABSTRACT Artur Yakimovich works in the field of computational virology and applies machine learning algorithms to study host-pathogen interactions. In this mSphere of Influence article, he reflects on two papers “Holographic Deep Learning for Rapid Optical Screening of Anthrax Spores” by Jo et al. (Y. Jo, S. Park, J. Jung, J. Yoon, et al., Sci Adv 3:e1700606, 2017, https://doi.org/10.1126/sciadv.1700606) and “Bacterial Colony Counting with Convolutional Neural Networks in Digital Microbiology Imaging” by Ferrari and colleagues (A. Ferrari, S. Lombardi, and A. Signoroni, Pattern Recognition 61:629–640, 2017, https://doi.org/10.1016/j.patcog.2016.07.016). Here he discusses how these papers made an impact on him by showcasing that artificial intelligence algorithms can be equally applicable to both classical infection biology techniques and cutting-edge label-free imaging of pathogens.

Incorporate Syntactic Information for Short Text Classification

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.268-270.697 ◽

2011 ◽

Vol 268-270 ◽

pp. 697-700

Author(s):

Rui Xue Duan ◽

Xiao Jie Wang ◽

Wen Feng Li

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Environment ◽

Text Classification ◽

The Internet ◽

Selection Methods ◽

Text Documents ◽

Short Text ◽

Syntactic Information ◽

Dependency Relations

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.

A Latent Dirichlet Allocation and Fuzzy Clustering Based Machine Learning Model for Text Thesaurus

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2020.2.3811 ◽

2020 ◽

Vol 15 (2) ◽

Author(s):

Jia Luo ◽

Dongwen Yu ◽

Zong Dai

Keyword(s):

Machine Learning ◽

Fuzzy Clustering ◽

Latent Dirichlet Allocation ◽

Learning Model ◽

Machine Learning Algorithms ◽

Text Data ◽

Huge Data ◽

Machine Learning Model ◽

N Gram ◽

Dirichlet Allocation

It is not quite possible to use manual methods to process the huge amount of structured and semi-structured data. This study aims to solve the problem of processing huge data through machine learning algorithms. We collected the text data of the company’s public opinion through crawlers, and use Latent Dirichlet Allocation (LDA) algorithm to extract the keywords of the text, and uses fuzzy clustering to cluster the keywords to form different topics. The topic keywords will be used as a seed dictionary for new word discovery. In order to verify the efficiency of machine learning in new word discovery, algorithms based on association rules, N-Gram, PMI, andWord2vec were used for comparative testing of new word discovery. The experimental results show that the Word2vec algorithm based on machine learning model has the highest accuracy, recall and F-value indicators.

A Bayesian neural network predicts the dissolution of compact planetary systems

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2026053118 ◽

2021 ◽

Vol 118 (40) ◽

pp. e2026053118

Author(s):

Miles Cranmer ◽

Daniel Tamayo ◽

Hanno Rein ◽

Peter Battaglia ◽

Samuel Hadden ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Time Series ◽

Planetary System ◽

Machine Learning Algorithms ◽

Orbital Elements ◽

Bayesian Neural Network ◽

Inference Model ◽

Link Type ◽

Numerical Integrator

We introduce a Bayesian neural network model that can accurately predict not only if, but also when a compact planetary system with three or more planets will go unstable. Our model, trained directly from short N-body time series of raw orbital elements, is more than two orders of magnitude more accurate at predicting instability times than analytical estimators, while also reducing the bias of existing machine learning algorithms by nearly a factor of three. Despite being trained on compact resonant and near-resonant three-planet configurations, the model demonstrates robust generalization to both nonresonant and higher multiplicity configurations, in the latter case outperforming models fit to that specific set of integrations. The model computes instability estimates up to 105 times faster than a numerical integrator, and unlike previous efforts provides confidence intervals on its predictions. Our inference model is publicly available in the SPOCK (https://github.com/dtamayo/spock) package, with training code open sourced (https://github.com/MilesCranmer/bnn_chaos_model).