Mining Archive.org’s Twitter Stream Grab for Pharmacovigilance Research Gold

Mapping Intimacies ◽

10.1101/859611 ◽

2019 ◽

Author(s):

Ramya Tekumalla ◽

Javad Rafiei Asl ◽

Juan M. Banda

Keyword(s):

Machine Learning ◽

Adverse Drug Reactions ◽

Future Research ◽

Learning Models ◽

Drug Reactions ◽

Computing Power ◽

Important Resource ◽

Knowing That ◽

Machine Learning Models

AbstractIn the last few years Twitter has become an important resource for the identification of Adverse Drug Reactions (ADRs), monitoring flu trends, and other pharmacovigilance and general research applications. Most researchers spend their time crawling Twitter, buying expensive pre-mined datasets, or tediously and slowly building datasets using the limited Twitter API. However, there are a large number of datasets that are publicly available to researchers which are underutilized or unused. In this work, we demonstrate how we mined over 9.4 billion Tweets from archive.org’s Twitter stream grab using a drug-term dictionary and plenty of computing power. Knowing that not everything that shines is gold, we used pre-existing drug-related datasets to build machine learning models to filter our findings for relevance. In this work we present our methodology and the 3,346,758 identified tweets for public use in future research.

Download Full-text

Predicting adverse drug reaction outcomes with machine learning

International Journal of Community Medicine and Public Health ◽

10.18203/2394-6040.ijcmph20180744 ◽

2018 ◽

Vol 5 (3) ◽

pp. 901 ◽

Cited By ~ 3

Author(s):

Andy W. Chen

Keyword(s):

Machine Learning ◽

Adverse Drug Reactions ◽

Predictive Models ◽

Safety Issue ◽

Adverse Event Reporting System ◽

Patient Characteristics ◽

Learning Models ◽

Drug Reactions ◽

Drug Safety Issue ◽

Machine Learning Models

Background: Adverse drug reactions are a drug safety issue affecting more than two million people in the U.S. annually. The Food and Drug Administration (FDA) maintains a comprehensive database of adverse drug reactions reported known as FAERS (FDA adverse event reporting system), providing a valuable resource for studying factors associated with ADRs. The goal of the project is to build predictive models to predict the outcome given patient characteristics and drug usage. The results can be valuable for health care practitioners by offering new knowledge on adverse drug reactions which can be used to improve decision making related to drug prescriptions.Methods: In this paper I present and discuss results from machine learning models used to predict outcomes of ADRs. Machine learning models are a popular set of models for prediction. They have gained attention recently and have been used in a variety of fields. They can be trained on existing data and retrained when new data become available. The trained models are then used to make predictions.Results: I find that the supervised learning models are work similarly within groups, with accuracy between 65% and 75% for predicting deaths and 70% to 75% for predicting hospitalizations. Across groups the models predict hospitalizations better than deaths.Conclusions: The predictive models I built achieve good accuracy. The results can potentially be improved when more data become available in the future.

Download Full-text

Predicting neurological Adverse Drug Reactions based on biological, chemical and phenotypic properties of drugs using machine learning models

Scientific Reports ◽

10.1038/s41598-017-00908-z ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 17

Author(s):

Salma Jamal ◽

Sukriti Goyal ◽

Asheesh Shanker ◽

Abhinav Grover

Keyword(s):

Machine Learning ◽

Adverse Drug Reactions ◽

Learning Models ◽

Drug Reactions ◽

Phenotypic Properties ◽

Machine Learning Models

Download Full-text

A Universal Screening Tool for Dyslexia by a Web-Game and Machine Learning

Frontiers in Computer Science ◽

10.3389/fcomp.2021.628634 ◽

2022 ◽

Vol 3 ◽

Author(s):

Maria Rauschenberger ◽

Ricardo Baeza-Yates ◽

Luz Rello

Keyword(s):

Machine Learning ◽

Early Intervention ◽

Universal Screening ◽

User Study ◽

Future Research ◽

Learning Models ◽

Early Screening ◽

Web Based ◽

Negative Side Effects ◽

Machine Learning Models

Children with dyslexia have difficulties learning how to read and write. They are often diagnosed after they fail school even if dyslexia is not related to general intelligence. Early screening of dyslexia can prevent the negative side effects of late detection and enables early intervention. In this context, we present an approach for universal screening of dyslexia using machine learning models with data gathered from a web-based language-independent game. We designed the game content taking into consideration the analysis of mistakes of people with dyslexia in different languages and other parameters related to dyslexia like auditory perception as well as visual perception. We did a user study with 313 children (116 with dyslexia) and train predictive machine learning models with the collected data. Our method yields an accuracy of 0.74 for German and 0.69 for Spanish as well as a F1-score of 0.75 for German and 0.75 for Spanish, using Random Forests and Extra Trees, respectively. We also present the game content design, potential new auditory input, and knowledge about the design approach for future research to explore Universal screening of dyslexia. universal screening with language-independent content can be used for the screening of pre-readers who do not have any language skills, facilitating a potential early intervention.

Download Full-text

Risk Assessment in Energy Infrastructure Installations by Horizontal Directional Drilling Using Machine Learning

Energies ◽

10.3390/en14020289 ◽

2021 ◽

Vol 14 (2) ◽

pp. 289

Author(s):

Maria Krechowicz ◽

Adam Krechowicz

Keyword(s):

Machine Learning ◽

Risk Assessment ◽

Assessment Process ◽

Future Research ◽

Directional Drilling ◽

Ann Model ◽

Learning Models ◽

Horizontal Directional Drilling ◽

Risk Assessment Process ◽

Machine Learning Models

Nowadays we can observe a growing demand for installations of new gas pipelines in Europe. A large number of them are installed using trenchless Horizontal Directional Drilling (HDD) technology. The aim of this work was to develop and compare new machine learning models dedicated for risk assessment in HDD projects. The data from 133 HDD projects from eight countries of the world were gathered, profiled, and preprocessed. Three machine learning models, logistic regression, random forests, and Artificial Neural Network (ANN), were developed to predict the overall HDD project outcome (failure free installation or installation likely to fail), and the occurrence of identified unwanted events. The best performance in terms of recall and accuracy was achieved for the developed ANN model, which proved to be efficient, fast and robust in predicting risks in HDD projects. Machine learning applications in the proposed models enabled eliminating the involvement of a group of experts in the risk assessment process and therefore significantly lower the costs associated with the risk assessment process. Future research may be oriented towards developing a comprehensive risk management system, which will enable dynamic risk assessment taking into account various combinations of risk mitigation actions.

Download Full-text

Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-080917-013315 ◽

2018 ◽

Vol 1 (1) ◽

pp. 53-68 ◽

Cited By ~ 40

Author(s):

Juan M. Banda ◽

Martin Seneviratne ◽

Tina Hernandez-Boussard ◽

Nigam H. Shah

Keyword(s):

Machine Learning ◽

Clinical Decision ◽

Future Research ◽

Learning Models ◽

Rule Based ◽

Research Problems ◽

Fundamental Research ◽

Future Research Directions ◽

Effectiveness Studies ◽

Machine Learning Models

With the widespread adoption of electronic health records (EHRs), large repositories of structured and unstructured patient data are becoming available to conduct observational studies. Finding patients with specific conditions or outcomes, known as phenotyping, is one of the most fundamental research problems encountered when using these new EHR data. Phenotyping forms the basis of translational research, comparative effectiveness studies, clinical decision support, and population health analyses using routinely collected EHR data. We review the evolution of electronic phenotyping, from the early rule-based methods to the cutting edge of supervised and unsupervised machine learning models. We aim to cover the most influential papers in commensurate detail, with a focus on both methodology and implementation. Finally, future research directions are explored.

Download Full-text

BitterSweet: Building machine learning models for predicting the bitter and sweet taste of small molecules

10.1101/426692 ◽

2018 ◽

Author(s):

Rudraksh Tuwani ◽

Somin Wadhwa ◽

Ganesh Bagler

Keyword(s):

Machine Learning ◽

Molecular Descriptors ◽

Wide Spectrum ◽

Predictive Performance ◽

Sweet Taste ◽

Future Research ◽

Gustatory System ◽

Learning Models ◽

Integrative Framework ◽

Machine Learning Models

ABSTRACTThe dichotomy of sweet and bitter tastes is a salient evolutionary feature of human gustatory system with an innate attraction to sweet taste and aversion to bitterness. A better understanding of molecular correlates of bitter-sweet taste gradient is crucial for identification of natural as well as synthetic compounds of desirable taste on this axis. While previous studies have advanced our understanding of the molecular basis of bitter-sweet taste and contributed models for their identification, there is ample scope to enhance these models by meticulous compilation of bitter-sweet molecules and utilization of a wide spectrum of molecular descriptors. Towards these goals, based on structured data compilation our study provides an integrative framework with state-of-the-art machine learning models for bitter-sweet taste prediction (BitterSweet). We compare different sets of molecular descriptors for their predictive performance and further identify important features as well as feature blocks. The utility of BitterSweet models is demonstrated by taste prediction on large specialized chemical sets such as FlavorDB, FooDB, SuperSweet, Super Natural II, DSSTox, and DrugBank. To facilitate future research in this direction, we make all datasets and BitterSweet models publicly available, and also present an end-to-end software for bitter-sweet taste prediction based on freely available chemical descriptors.

Download Full-text

Comparing regression modeling strategies for predicting hometime

BMC Medical Research Methodology ◽

10.1186/s12874-021-01331-9 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jessalyn K. Holodinsky ◽

Amy Y. X. Yu ◽

Moira K. Kapral ◽

Peter C. Austin

Keyword(s):

Machine Learning ◽

Administrative Data ◽

Statistical Models ◽

Predictive Accuracy ◽

Extreme Values ◽

Future Research ◽

Learning Models ◽

Learning Methods ◽

Machine Learning Methods ◽

Machine Learning Models

Abstract Background Hometime, the total number of days a person is living in the community (not in a healthcare institution) in a defined period of time after a hospitalization, is a patient-centred outcome metric increasingly used in healthcare research. Hometime exhibits several properties which make its statistical analysis difficult: it has a highly non-normal distribution, excess zeros, and is bounded by both a lower and upper limit. The optimal methodology for the analysis of hometime is currently unknown. Methods Using administrative data we identified adult patients diagnosed with stroke between April 1, 2010 and December 31, 2017 in Ontario, Canada. 90-day hometime and clinically relevant covariates were determined through administrative data linkage. Fifteen different statistical and machine learning models were fit to the data using a derivation sample. The models’ predictive accuracy and bias were assessed using an independent validation sample. Results Seventy-five thousand four hundred seventy-five patients were identified (divided into a derivation set of 49,402 and a test set of 26,073). In general, the machine learning models had lower root mean square error and mean absolute error than the statistical models. However, some statistical models resulted in lower (or equal) bias than the machine learning models. Most of the machine learning models constrained predicted values between the minimum and maximum observable hometime values but this was not the case for the statistical models. The machine learning models also allowed for the display of complex non-linear interactions between covariates and hometime. No model captured the non-normal bucket shaped hometime distribution. Conclusions Overall, no model clearly outperformed the others. However, it was evident that machine learning methods performed better than traditional statistical methods. Among the machine learning methods, generalized boosting machines using the Poisson distribution as well as random forests regression were the best performing. No model was able to capture the bucket shaped hometime distribution and future research on factors which are associated with extreme values of hometime that are not available in administrative data is warranted.

Download Full-text

Learning from low precision samples

The International FLAIRS Conference Proceedings ◽

10.32473/flairs.v34i1.128568 ◽

2021 ◽

Vol 34 (1) ◽

Author(s):

Ji In Choi ◽

Madeleine Georges ◽

Jung Ah Shin ◽

Olivia Wang ◽

Tiffany Zhu ◽

...

Keyword(s):

Machine Learning ◽

Streaming Data ◽

Future Research ◽

Support Vector ◽

Stochastic Quantization ◽

Learning Models ◽

Performance Loss ◽

Research Areas ◽

Upper Level ◽

Machine Learning Models

With advances in edge applications in industry and healthcare, machine learning models are increasingly trained on the edge. However, storage and memory infrastructure at the edge are often primitive, due to cost and real-estate constraints.A simple, effective method is to learn machine learning models from quantized data stored with low arithmetic precision (1-8 bits).In this work, we introduce two stochastic quantization methods, dithering and stochastic rounding. In dithering, additive noise from a uniform distribution is added to the sample before quantization. In stochastic rounding, each sample is quantized to the upper level with probability p and to a lower level with probability 1-p.The key contributions of the paper are as follows: For 3 standard machine learning models, Support Vector Machines, Decision Trees and Linear (Logistic) Regression, we compare the performance loss for a standard static quantization and stochastic quantization for 55 classification and 30 regression datasets with 1-8 bits quantization. We showcase that for 4- and 8-bit quantization over regression datasets, stochastic quantization demonstrates statistically significant improvement. We investigate the performance loss as a function of dataset attributes viz. number of features, standard deviation, skewness. This helps create a transfer function which will recommend the best quantizer for a given dataset. We propose 2 future research areas, dynamic quantizer update where the model is trained using streaming data and the quantizer is updated after each batch and precision re-allocation under budget constraints where different precision is used for different features.

Download Full-text

Competence region estimation for black-box surrogate models

The International FLAIRS Conference Proceedings ◽

10.32473/flairs.v34i1.128571 ◽

2021 ◽

Vol 34 (1) ◽

Author(s):

Tapan Shah

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Black Box ◽

Future Research ◽

Stochastic Quantization ◽

Learning Models ◽

Performance Loss ◽

Research Areas ◽

Upper Level ◽

Machine Learning Models

With advances in edge applications for industry andhealthcare, machine learning models are increasinglytrained on the edge. However, storage and memory in-frastructure at the edge are often primitive, due to costand real-estate constraints. A simple, effective methodis to learn machine learning models from quantized datastored with low arithmetic precision (1-8 bits). In thiswork, we introduce two stochastic quantization meth-ods, dithering and stochastic rounding. In dithering, ad-ditive noise from a uniform distribution is added tothe sample before quantization. In stochastic rounding,each sample is quantized to the upper level with prob-ability p and to a lower level with probability 1-p. Thekey contributions of the paper are For 3 standard machine learning models, Support Vec-tor Machines, Decision Trees and Linear (Logistic)Regression, we compare the performance loss for astandard static quantization and stochastic quantiza-tion for 55 classification and 30 regression datasetswith 1-8 bits quantization. We showcase that for 4- and 8-bit quantization overregression datasets, stochastic quantization demon-strates statistically significant improvement. We investigate the performance loss as a function ofdataset attributes viz. number of features, standard de-viation, skewness. This helps create a transfer functionwhich will recommend the best quantizer for a givendataset. We propose 2 future research areas, a) dynamic quan-tizer update where the model is trained using stream-ing data and the quantizer is updated after each batchand b) precision re-allocation under budget constraintswhere different precision is used for different features.

Download Full-text

Improving XGBoost with Imagination Sampling

Communications of the Blyth Institute ◽

10.33014/issn.2640-5652.2.1.holloway.1 ◽

2020 ◽

Vol 2 (1) ◽

pp. 3-6

Author(s):

Eric Holloway

Keyword(s):

Machine Learning ◽

General System ◽

Learning Models ◽

Starting Point ◽

Machine Learning Models

Imagination Sampling is the usage of a person as an oracle for generating or improving machine learning models. Previous work demonstrated a general system for using Imagination Sampling for obtaining multibox models. Here, the possibility of importing such models as the starting point for further automatic enhancement is explored.

Download Full-text