Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature with Unsupervised Word Embeddings and Machine Learning (Preprint)

Mapping Intimacies ◽

10.2196/preprints.34067 ◽

2021 ◽

Author(s):

Ridam Pal ◽

Harshita Chopra ◽

Raghav Awasthi ◽

Harsh Bandhey ◽

Aditya Nagori ◽

...

Keyword(s):

Machine Learning ◽

Learning Community ◽

Prediction Models ◽

Predictive Modelling ◽

Neurological Complications ◽

Language Models ◽

Supervised Machine Learning ◽

Monthly Interval ◽

Word Embeddings ◽

Topological Features

BACKGROUND Evidence from peer-reviewed literature is the cornerstone for designing responses to global threats such as COVID-19. The collection of knowledge in publications needs to be distilled into evidence by leveraging natural language models and machine learning. OBJECTIVE We aim to show that new knowledge can be captured and tracked using the temporal change in the underlying unsupervised word embeddings of literature. Further imminent themes can be predicted using machine learning upon the evolving associations between words. METHODS Frequently occurring medical entities were extracted from the abstracts of more than 150,000 COVID-19 articles published on the WHO database, collected on a monthly interval starting from February 2020. Word embeddings trained on each month's literature were used to construct networks of entities with cosine similarities as edge weights. Topological features of the subsequent month’s network were forecasted based on prior patterns and new links were predicted using supervised machine learning. Community detection and alluvial diagrams were used to track biomedical themes that evolved over the months. RESULTS We found that thromboembolic complications were detected as an emerging theme as early as August 2020. A shift towards symptoms of Long COVID complications was observed during March 2021 and neurological complications gained significance in June 2021. A prospective validation of the link prediction models achieved an AUROC score of 0.87. Predictive modelling revealed predisposing conditions, symptoms, cross-infection and neurological complications as a dominant research theme in COVID-19 publications based on patterns observed in previous months. CONCLUSIONS Machine learning-based prediction of emerging links can contribute towards steering research by capturing themes represented by groups of medical entities, based on patterns of semantic relationships over time.

Cross-lingual transfer of sentiment classifiers

Slovenščina 2 0 empirical applied and interdisciplinary research ◽

10.4312/slo2.0.2021.1.1-25 ◽

2021 ◽

Vol 9 (1) ◽

pp. 1-25

Author(s):

Marko Robnik-Šikonja ◽

Kristjan Reba ◽

Igor Mozetič

Keyword(s):

Machine Learning ◽

Vector Space ◽

Prediction Models ◽

Language Models ◽

Target Language ◽

Word Embeddings ◽

Language Data ◽

Cross Lingual ◽

Multiple Languages ◽

Transfer Mechanisms

Word embeddings represent words in a numeric space so that semantic relations between words are represented as distances and directions in the vector space. Cross-lingual word embeddings transform vector spaces of different languages so that similar words are aligned. This is done by mapping one language’s vector space to the vector space of another language or by construction of a joint vector space for multiple languages. Cross-lingual embeddings can be used to transfer machine learning models between languages, thereby compensating for insufficient data in less-resourced languages. We use cross-lingual word embeddings to transfer machine learning prediction models for Twitter sentiment between 13 languages. We focus on two transfer mechanisms that recently show superior transfer performance. The first mechanism uses the trained models whose input is the joint numerical space for many languages as implemented in the LASER library. The second mechanism uses large pretrained multilingual BERT language models. Our experiments show that the transfer of models between similar languages is sensible, even with no target language data. The performance of cross-lingual models obtained with the multilingual BERT and LASER library is comparable, and the differences are language-dependent. The transfer with CroSloEngual BERT, pretrained on only three languages, is superior on these and some closely related languages.

Predictive Modelling of Employee Turnover in Indian IT Industry Using Machine Learning Techniques

Vision The Journal of Business Perspective ◽

10.1177/0972262918821221 ◽

2019 ◽

Vol 23 (1) ◽

pp. 12-21 ◽

Cited By ~ 2

Author(s):

Shikha N. Khera ◽

Divya

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Confusion Matrix ◽

Predictive Modelling ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

It Industry ◽

Knowledge Based ◽

Employee Attrition

Information technology (IT) industry in India has been facing a systemic issue of high attrition in the past few years, resulting in monetary and knowledge-based loses to the companies. The aim of this research is to develop a model to predict employee attrition and provide the organizations opportunities to address any issue and improve retention. Predictive model was developed based on supervised machine learning algorithm, support vector machine (SVM). Archival employee data (consisting of 22 input features) were collected from Human Resource databases of three IT companies in India, including their employment status (response variable) at the time of collection. Accuracy results from the confusion matrix for the SVM model showed that the model has an accuracy of 85 per cent. Also, results show that the model performs better in predicting who will leave the firm as compared to predicting who will not leave the company.

Machine Learning Frameworks in Cancer Detection

E3S Web of Conferences ◽

10.1051/e3sconf/202129701073 ◽

2021 ◽

Vol 297 ◽

pp. 01073

Author(s):

Sabyasachi Pramanik ◽

K. Martin Sagayam ◽

Om Prakash Jena

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Cancer Development ◽

Support Vector ◽

Learning Approaches ◽

Learning Techniques ◽

Fact Finding ◽

Risk Of Cancer

Cancer has been described as a diverse illness with several distinct subtypes that may occur simultaneously. As a result, early detection and forecast of cancer types have graced essentially in cancer fact-finding methods since they may help to improve the clinical treatment of cancer survivors. The significance of categorizing cancer suffers into higher or lower-threat categories has prompted numerous fact-finding associates from the bioscience and genomics field to investigate the utilization of machine learning (ML) algorithms in cancer diagnosis and treatment. Because of this, these methods have been used with the goal of simulating the development and treatment of malignant diseases in humans. Furthermore, the capacity of machine learning techniques to identify important characteristics from complicated datasets demonstrates the significance of these technologies. These technologies include Bayesian networks and artificial neural networks, along with a number of other approaches. Decision Trees and Support Vector Machines which have already been extensively used in cancer research for the creation of predictive models, also lead to accurate decision making. The application of machine learning techniques may undoubtedly enhance our knowledge of cancer development; nevertheless, a sufficient degree of validation is required before these approaches can be considered for use in daily clinical practice. An overview of current machine learning approaches utilized in the simulation of cancer development is presented in this paper. All of the supervised machine learning approaches described here, along with a variety of input characteristics and data samples, are used to build the prediction models. In light of the increasing trend towards the use of machine learning methods in biomedical research, we offer the most current papers that have used these approaches to predict risk of cancer or patient outcomes in order to better understand cancer.

Classification models using circulating neutrophil transcripts can detect unruptured intracranial aneurysm

Journal of Translational Medicine ◽

10.1186/s12967-020-02550-2 ◽

2020 ◽

Vol 18 (1) ◽

Author(s):

Kerry E. Poppenberg ◽

Vincent M. Tutino ◽

Lu Li ◽

Muhammad Waqas ◽

Armond June ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Model Performance ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Training Cohort ◽

Network Analyses ◽

Machine Learning Methods

Abstract Background Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

Comparing Supervised Machine Learning Strategies and Linguistic Features to Search for Very Negative Opinions

Information ◽

10.3390/info10010016 ◽

2019 ◽

Vol 10 (1) ◽

pp. 16 ◽

Cited By ~ 3

Author(s):

Sattam Almatarneh ◽

Pablo Gamallo

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Empirical Study ◽

Learning Strategies ◽

Supervised Machine Learning ◽

Support Vector ◽

Word Embeddings ◽

Linguistic Features ◽

Machine Learning Classifiers ◽

Supervised Machine Learning Classifiers

In this paper, we examine the performance of several classifiers in the process of searching for very negative opinions. More precisely, we do an empirical study that analyzes the influence of three types of linguistic features (n-grams, word embeddings, and polarity lexicons) and their combinations when they are used to feed different supervised machine learning classifiers: Naive Bayes (NB), Decision Tree (DT), and Support Vector Machine (SVM). The experiments we have carried out show that SVM clearly outperforms NB and DT in all datasets by taking into account all features individually as well as their combinations.

Supervised Machine Learning for Predicting SMME Sales: An Evaluation of Three Algorithms

The African Journal of Information and Communication ◽

10.23962/10539/31371 ◽

2021 ◽

pp. 1-21

Author(s):

Helper Zhou ◽

Victor Gumbo

Keyword(s):

Machine Learning ◽

Predictive Analytics ◽

Predictive Modelling ◽

Sales Performance ◽

Ordinary Least Squares ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Selection Operator

The emergence of machine learning algorithms presents the opportunity for a variety of stakeholders to perform advanced predictive analytics and to make informed decisions. However, to date there have been few studies in developing countries that evaluate the performance of such algorithms—with the result that pertinent stakeholders lack an informed basis for selecting appropriate techniques for modelling tasks. This study aims to address this gap by evaluating the performance of three machine learning techniques: ordinary least squares (OLS), least absolute shrinkage and selection operator (LASSO), and artificial neural networks (ANNs). These techniques are evaluated in respect of their ability to perform predictive modelling of the sales performance of small, medium and micro enterprises (SMMEs) engaged in manufacturing. The evaluation finds that the ANNs algorithm’s performance is far superior to that of the other two techniques, OLS and LASSO, in predicting the SMMEs’ sales performance.

Domain Heuristic Fusion of Multi-Word Embeddings for Nutrient Value Prediction

Mathematics ◽

10.3390/math9161941 ◽

2021 ◽

Vol 9 (16) ◽

pp. 1941

Author(s):

Gordana Ispirova ◽

Tome Eftimov ◽

Barbara Koroušić Seljak

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Nutrient Content ◽

Relevant Information ◽

Word Embeddings ◽

Short Text ◽

Domain Specific ◽

Nutrient Value ◽

Protein Prediction ◽

Vector Representations

Being both a poison and a cure for many lifestyle and non-communicable diseases, food is inscribing itself into the prime focus of precise medicine. The monitoring of few groups of nutrients is crucial for some patients, and methods for easing their calculations are emerging. Our proposed machine learning pipeline deals with nutrient prediction based on learned vector representations on short text–recipe names. In this study, we explored how the prediction results change when, instead of using the vector representations of the recipe description, we use the embeddings of the list of ingredients. The nutrient content of one food depends on its ingredients; therefore, the text of the ingredients contains more relevant information. We define a domain-specific heuristic for merging the embeddings of the ingredients, which combines the quantities of each ingredient in order to use them as features in machine learning models for nutrient prediction. The results from the experiments indicate that the prediction results improve when using the domain-specific heuristic. The prediction models for protein prediction were highly effective, with accuracies up to 97.98%. Implementing a domain-specific heuristic for combining multi-word embeddings yields better results than using conventional merging heuristics, with up to 60% more accuracy in some cases.

Bankruptcy Prediction by Supervised Machine Learning Techniques

Surveillance Technologies and Early Warning Systems ◽

10.4018/978-1-61692-865-0.ch007 ◽

2011 ◽

pp. 128-143 ◽

Cited By ~ 1

Author(s):

Chih-Fong Tsai ◽

Yu-Hsin Lu ◽

Yu-Feng Hsu

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Bankruptcy Prediction ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Business Failure ◽

Classifier Ensembles ◽

Stacked Generalization ◽

Learning Techniques ◽

Type Ii Errors

It is very important for financial institutions which are capable of accurately predicting business failure. In literature, numbers of bankruptcy prediction models have been developed based on statistical and machine learning techniques. In particular, many machine learning techniques, such as neural networks, decision trees, etc. have shown better prediction performances than statistical ones. However, advanced machine learning techniques, such as classifier ensembles and stacked generalization have not been fully examined and compared in terms of their bankruptcy prediction performances. The aim of this chapter is to compare two different machine learning techniques, one statistical approach, two types of classifier ensembles, and three stacked generalization classifiers over three related datasets. The experimental results show that classifier ensembles by weighted voting perform the best in term of predication accuracy. On the other hand, for Type II errors on average stacked generalization and single classifiers perform better than classifier ensembles.

Supervised machine learning methods in psychology: A practical introduction with annotated R code

10.31234/osf.io/s72vu ◽

2019 ◽

Author(s):

Hannes Rosenbusch ◽

Felix Soldner ◽

Anthony M Evans ◽

Marcel Zeelenberg

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Psychological Research ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Comprehensive Overview ◽

K Nearest Neighbors ◽

Machine Learning Methods ◽

Out Of Sample

Machine learning methods for pattern detection and prediction are increasingly prevalent in psychological research. We provide a comprehensive overview of machine learning, its applications, and how to implement models for research. We review fundamental concepts of machine learning, such as prediction accuracy and out-of-sample evaluation, and summarize four standard prediction algorithms: linear regressions, ridge regressions, decision trees, and random forests (plus k-nearest neighbors, Naïve Bayes classifiers, and support vector machines in the supplementary material). This selection provides a set of powerful models that are implemented regularly in machine learning projects. We demonstrate each method with examples and annotated R code, and discuss best practices for determining sample sizes; comparing model performances; tuning prediction models; preregistering prediction models; and reporting results. Finally, we discuss the value of machine learning methods in maintaining psychology’s status as a predictive science.

Comparing Supervised Machine Learning Strategies and Linguistic Features to Search for Very Negative Opinions

10.20944/preprints201811.0436.v1 ◽

2018 ◽

Author(s):

Sattam Almatarneh ◽

Pablo Gamallo

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Empirical Study ◽

Learning Strategies ◽

Supervised Machine Learning ◽

Support Vector ◽

Word Embeddings ◽

Linguistic Features ◽

Machine Learning Classifiers ◽

Supervised Machine Learning Classifiers

In this paper, we examine the performance of several classifiers in the process of searching for very negative opinions. More precisely, we do an empirical study that analyzes the influence of three types of linguistic features (n-grams, word embeddings, and polarity lexicons) and their combinations when they are used to feed different supervised machine learning classifiers: Support Vector Machine (SVM), Naive Bayes (NB), and Decision Tree (DT).