scholarly journals Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study (Preprint)

2019 ◽  
Author(s):  
Shyam Visweswaran ◽  
Jason B Colditz ◽  
Patrick O’Halloran ◽  
Na-Rae Han ◽  
Sanya B Taneja ◽  
...  

BACKGROUND Twitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers offer the advantage of requiring fewer annotated tweets by leveraging the large numbers of readily available unannotated tweets. OBJECTIVE This study aims to derive and evaluate traditional and deep learning classifiers that can identify tweets relevant to vaping, tweets of a commercial nature, and tweets with provape sentiments. METHODS We continuously collected tweets that matched vaping-related keywords over 2 months from August 2018 to October 2018. From this data set of tweets, a set of 4000 tweets was selected, and each tweet was manually annotated for relevance (vape relevant or not), commercial nature (commercial or not), and sentiment (provape or not). Using the annotated data, we derived traditional classifiers that included logistic regression, random forest, linear support vector machine, and multinomial naive Bayes. In addition, using the annotated data set and a larger unannotated data set of tweets, we derived deep learning classifiers that included a convolutional neural network (CNN), long short-term memory (LSTM) network, LSTM-CNN network, and bidirectional LSTM (BiLSTM) network. The unannotated tweet data were used to derive word vectors that deep learning classifiers can leverage to improve performance. RESULTS LSTM-CNN performed the best with the highest area under the receiver operating characteristic curve (AUC) of 0.96 (95% CI 0.93-0.98) for relevance, all deep learning classifiers including LSTM-CNN performed better than the traditional classifiers with an AUC of 0.99 (95% CI 0.98-0.99) for distinguishing commercial from noncommercial tweets, and BiLSTM performed the best with an AUC of 0.83 (95% CI 0.78-0.89) for provape sentiment. Overall, LSTM-CNN performed the best across all 3 classification tasks. CONCLUSIONS We derived and evaluated traditional machine learning and deep learning classifiers to identify vaping-related relevant, commercial, and provape tweets. Overall, deep learning classifiers such as LSTM-CNN had superior performance and had the added advantage of requiring no preprocessing. The performance of these classifiers supports the development of a vaping surveillance system.

10.2196/17478 ◽  
2020 ◽  
Vol 22 (8) ◽  
pp. e17478 ◽  
Author(s):  
Shyam Visweswaran ◽  
Jason B Colditz ◽  
Patrick O’Halloran ◽  
Na-Rae Han ◽  
Sanya B Taneja ◽  
...  

Background Twitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers offer the advantage of requiring fewer annotated tweets by leveraging the large numbers of readily available unannotated tweets. Objective This study aims to derive and evaluate traditional and deep learning classifiers that can identify tweets relevant to vaping, tweets of a commercial nature, and tweets with provape sentiments. Methods We continuously collected tweets that matched vaping-related keywords over 2 months from August 2018 to October 2018. From this data set of tweets, a set of 4000 tweets was selected, and each tweet was manually annotated for relevance (vape relevant or not), commercial nature (commercial or not), and sentiment (provape or not). Using the annotated data, we derived traditional classifiers that included logistic regression, random forest, linear support vector machine, and multinomial naive Bayes. In addition, using the annotated data set and a larger unannotated data set of tweets, we derived deep learning classifiers that included a convolutional neural network (CNN), long short-term memory (LSTM) network, LSTM-CNN network, and bidirectional LSTM (BiLSTM) network. The unannotated tweet data were used to derive word vectors that deep learning classifiers can leverage to improve performance. Results LSTM-CNN performed the best with the highest area under the receiver operating characteristic curve (AUC) of 0.96 (95% CI 0.93-0.98) for relevance, all deep learning classifiers including LSTM-CNN performed better than the traditional classifiers with an AUC of 0.99 (95% CI 0.98-0.99) for distinguishing commercial from noncommercial tweets, and BiLSTM performed the best with an AUC of 0.83 (95% CI 0.78-0.89) for provape sentiment. Overall, LSTM-CNN performed the best across all 3 classification tasks. Conclusions We derived and evaluated traditional machine learning and deep learning classifiers to identify vaping-related relevant, commercial, and provape tweets. Overall, deep learning classifiers such as LSTM-CNN had superior performance and had the added advantage of requiring no preprocessing. The performance of these classifiers supports the development of a vaping surveillance system.


2021 ◽  
Vol 50 (8) ◽  
pp. 2479-2497
Author(s):  
Buvana M. ◽  
Muthumayil K.

One of the most symptomatic diseases is COVID-19. Early and precise physiological measurement-based prediction of breathing will minimize the risk of COVID-19 by a reasonable distance from anyone; wearing a mask, cleanliness, medication, balanced diet, and if not well stay safe at home. To evaluate the collected datasets of COVID-19 prediction, five machine learning classifiers were used: Nave Bayes, Support Vector Machine (SVM), Logistic Regression, K-Nearest Neighbour (KNN), and Decision Tree. COVID-19 datasets from the Repository were combined and re-examined to remove incomplete entries, and a total of 2500 cases were utilized in this study. Features of fever, body pain, runny nose, difficulty in breathing, shore throat, and nasal congestion, are considered to be the most important differences between patients who have COVID-19s and those who do not. We exhibit the prediction functionality of five machine learning classifiers. A publicly available data set was used to train and assess the model. With an overall accuracy of 99.88 percent, the ensemble model is performed commendably. When compared to the existing methods and studies, the proposed model is performed better. As a result, the model presented is trustworthy and can be used to screen COVID-19 patients timely, efficiently.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5982 ◽  
Author(s):  
Hwan-ho Cho ◽  
Seung-hak Lee ◽  
Jonghoon Kim ◽  
Hyunjin Park

Background Grading of gliomas is critical information related to prognosis and survival. We aimed to apply a radiomics approach using various machine learning classifiers to determine the glioma grading. Methods We considered 285 (high grade n = 210, low grade n = 75) cases obtained from the Brain Tumor Segmentation 2017 Challenge. Manual annotations of enhancing tumors, non-enhancing tumors, necrosis, and edema were provided by the database. Each case was multi-modal with T1-weighted, T1-contrast enhanced, T2-weighted, and FLAIR images. A five-fold cross validation was adopted to separate the training and test data. A total of 468 radiomics features were calculated for three types of regions of interest. The minimum redundancy maximum relevance algorithm was used to select features useful for classifying glioma grades in the training cohort. The selected features were used to build three classifier models of logistics, support vector machines, and random forest classifiers. The classification performance of the models was measured in the training cohort using accuracy, sensitivity, specificity, and area under the curve (AUC) of the receiver operating characteristic curve. The trained classifier models were applied to the test cohort. Results Five significant features were selected for the machine learning classifiers and the three classifiers showed an average AUC of 0.9400 for training cohorts and 0.9030 (logistic regression 0.9010, support vector machine 0.8866, and random forest 0.9213) for test cohorts. Discussion Glioma grading could be accurately determined using machine learning and feature selection techniques in conjunction with a radiomics approach. The results of our study might contribute to high-throughput computer aided diagnosis system for gliomas.


Author(s):  
Grayson R. Morgan ◽  
Cuizhen Wang ◽  
Zhenlong Li ◽  
Steven R. Schill ◽  
Daniel R. Morgan

Deep learning techniques are increasingly being recognized as effective image classifiers. Aside from their successful performance in past studies, the accuracies have varied in complex environments in comparison with the popularly applied machine learning classifiers. This study seeks to explore the feasibility for using a U-Net deep learning architecture to classify bi-temporal high resolution county scale aerial images to determine the spatial extent and changes of land cover classes that directly or indirectly impact tidal marsh. The image set used in the analysis is a collection of a 1-m resolution collection of National Agriculture Imagery Program (NAIP) tiles from 2009 and 2019 covering Beaufort County, South Carolina. The U-net CNN classification results were compared with two machine learning classifiers, the Random Trees (RT) and the Support Vector Machine (SVM). The results revealed a significant accuracy advantage in using the U-Net classifier (92.4%) as opposed to the SVM (81.6%) and RT (75.7%) classifiers for overall accuracy. From the perspective of a GIS analyst or coastal manager, the U-Net classifier is now an easily accessible nad powerful tool for mapping large areas. Change detection analysis indicated little areal change on marsh extent, though increased land development throughout the county has the potential to negatively impact the health of the marshes. Future work should explore applying the constructed U-Net classifier to coastal environments in large geographic areas, while also implementing other data sources (e.g., LIDAR, multispectral data) to enhance classification accuracy.


2021 ◽  
pp. 016555152110065
Author(s):  
Rahma Alahmary ◽  
Hmood Al-Dossari

Sentiment analysis (SA) aims to extract users’ opinions automatically from their posts and comments. Almost all prior works have used machine learning algorithms. Recently, SA research has shown promising performance in using the deep learning approach. However, deep learning is greedy and requires large datasets to learn, so it takes more time for data annotation. In this research, we proposed a semiautomatic approach using Naïve Bayes (NB) to annotate a new dataset in order to reduce the human effort and time spent on the annotation process. We created a dataset for the purpose of training and testing the classifier by collecting Saudi dialect tweets. The dataset produced from the semiautomatic model was then used to train and test deep learning classifiers to perform Saudi dialect SA. The accuracy achieved by the NB classifier was 83%. The trained semiautomatic model was used to annotate the new dataset before it was fed into the deep learning classifiers. The three deep learning classifiers tested in this research were convolutional neural network (CNN), long short-term memory (LSTM) and bidirectional long short-term memory (Bi-LSTM). Support vector machine (SVM) was used as the baseline for comparison. Overall, the performance of the deep learning classifiers exceeded that of SVM. The results showed that CNN reported the highest performance. On one hand, the performance of Bi-LSTM was higher than that of LSTM and SVM, and, on the other hand, the performance of LSTM was higher than that of SVM. The proposed semiautomatic annotation approach is usable and promising to increase speed and save time and effort in the annotation process.


2019 ◽  
Vol 16 (2) ◽  
pp. 5-16
Author(s):  
Amit Singh ◽  
Ivan Li ◽  
Otto Hannuksela ◽  
Tjonnie Li ◽  
Kyungmin Kim

Gravitational waves are theorized to be gravitationally lensed when they propagate near massive objects. Such lensing effects cause potentially detectable repeated gravitational wave patterns in ground- and space-based gravitational wave detectors. These effects are difficult to discriminate when the lens is small and the repeated patterns superpose. Traditionally, matched filtering techniques are used to identify gravitational-wave signals, but we instead aim to utilize machine learning techniques to achieve this. In this work, we implement supervised machine learning classifiers (support vector machine, random forest, multi-layer perceptron) to discriminate such lensing patterns in gravitational wave data. We train classifiers with spectrograms of both lensed and unlensed waves using both point-mass and singular isothermal sphere lens models. As the result, classifiers return F1 scores ranging from 0:852 to 0:996, with precisions from 0:917 to 0:992 and recalls ranging from 0:796 to 1:000 depending on the type of classifier and lensing model used. This supports the idea that machine learning classifiers are able to correctly determine lensed gravitational wave signals. This also suggests that in the future, machine learning classifiers may be used as a possible alternative to identify lensed gravitational wave events and to allow us to study gravitational wave sources and massive astronomical objects through further analysis. KEYWORDS: Gravitational Waves; Gravitational Lensing; Geometrical Optics; Machine Learning; Classification; Support Vector Machine; Random Tree Forest; Multi-layer Perceptron


2022 ◽  
Vol 12 (2) ◽  
pp. 828
Author(s):  
Tebogo Bokaba ◽  
Wesley Doorsamy ◽  
Babu Sena Paul

Road traffic accidents (RTAs) are a major cause of injuries and fatalities worldwide. In recent years, there has been a growing global interest in analysing RTAs, specifically concerned with analysing and modelling accident data to better understand and assess the causes and effects of accidents. This study analysed the performance of widely used machine learning classifiers using a real-life RTA dataset from Gauteng, South Africa. The study aimed to assess prediction model designs for RTAs to assist transport authorities and policymakers. It considered classifiers such as naïve Bayes, logistic regression, k-nearest neighbour, AdaBoost, support vector machine, random forest, and five missing data methods. These classifiers were evaluated using five evaluation metrics: accuracy, root-mean-square error, precision, recall, and receiver operating characteristic curves. Furthermore, the assessment involved parameter adjustment and incorporated dimensionality reduction techniques. The empirical results and analyses show that the RF classifier, combined with multiple imputations by chained equations, yielded the best performance when compared with the other combinations.


2019 ◽  
Vol 8 (4) ◽  
pp. 2187-2191

Music in an essential part of life and the emotion carried by it is key to its perception and usage. Music Emotion Recognition (MER) is the task of identifying the emotion in musical tracks and classifying them accordingly. The objective of this research paper is to check the effectiveness of popular machine learning classifiers like XGboost, Random Forest, Decision Trees, Support Vector Machine (SVM), K-Nearest-Neighbour (KNN) and Gaussian Naive Bayes on the task of MER. Using the MIREX-like dataset [17] to test these classifiers, the effects of oversampling algorithms like Synthetic Minority Oversampling Technique (SMOTE) [22] and Random Oversampling (ROS) were also verified. In all, the Gaussian Naive Bayes classifier gave the maximum accuracy of 40.33%. The other classifiers gave accuracies in between 20.44% and 38.67%. Thus, a limit on the classification accuracy has been reached using these classifiers and also using traditional musical or statistical metrics derived from the music as input features. In view of this, deep learning-based approaches using Convolutional Neural Networks (CNNs) [13] and spectrograms of the music clips for MER is a promising alternative.


2020 ◽  
Author(s):  
John T. Halloran ◽  
Gregor Urban ◽  
David Rocke ◽  
Pierre Baldi

AbstractSemi-supervised machine learning post-processors critically improve peptide identification of shot-gun proteomics data. Such post-processors accept the peptide-spectrum matches (PSMs) and feature vectors resulting from a database search, train a machine learning classifier, and recalibrate PSMs using the trained parameters, often yielding significantly more identified peptides across q-value thresholds. However, current state-of-the-art post-processors rely on shallow machine learning methods, such as support vector machines. In contrast, the powerful training capabilities of deep learning models have displayed superior performance to shallow models in an ever-growing number of other fields. In this work, we show that deep models significantly improve the recalibration of PSMs compared to the most accurate and widely-used post-processors, such as Percolator and PeptideProphet. Furthermore, we show that deep learning is able to adaptively analyze complex datasets and features for more accurate universal post-processing, leading to both improved Prosit analysis and markedly better recalibration of recently developed database-search functions.


Sign in / Sign up

Export Citation Format

Share Document