scholarly journals Predicting depression using deep learning and ensemble algorithms on raw twitter data

Author(s):  
Nisha P. Shetty ◽  
Balachandra Muniyal ◽  
Arshia Anand ◽  
Sushant Kumar ◽  
Sushant Prabhu

Social network and microblogging sites such as Twitter are widespread amongst all generations nowadays where people connect and share their feelings, emotions, pursuits etc. Depression, one of the most common mental disorder, is an acute state of sadness where person loses interest in all activities. If not treated immediately this can result in dire consequences such as death. In this era of virtual world, people are more comfortable in expressing their emotions in such sites as they have become a part and parcel of everyday lives. The research put forth thus, employs machine learning classifiers on the twitter data set to detect if a person’s tweet indicates any sign of depression or not.

10.2196/17478 ◽  
2020 ◽  
Vol 22 (8) ◽  
pp. e17478 ◽  
Author(s):  
Shyam Visweswaran ◽  
Jason B Colditz ◽  
Patrick O’Halloran ◽  
Na-Rae Han ◽  
Sanya B Taneja ◽  
...  

Background Twitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers offer the advantage of requiring fewer annotated tweets by leveraging the large numbers of readily available unannotated tweets. Objective This study aims to derive and evaluate traditional and deep learning classifiers that can identify tweets relevant to vaping, tweets of a commercial nature, and tweets with provape sentiments. Methods We continuously collected tweets that matched vaping-related keywords over 2 months from August 2018 to October 2018. From this data set of tweets, a set of 4000 tweets was selected, and each tweet was manually annotated for relevance (vape relevant or not), commercial nature (commercial or not), and sentiment (provape or not). Using the annotated data, we derived traditional classifiers that included logistic regression, random forest, linear support vector machine, and multinomial naive Bayes. In addition, using the annotated data set and a larger unannotated data set of tweets, we derived deep learning classifiers that included a convolutional neural network (CNN), long short-term memory (LSTM) network, LSTM-CNN network, and bidirectional LSTM (BiLSTM) network. The unannotated tweet data were used to derive word vectors that deep learning classifiers can leverage to improve performance. Results LSTM-CNN performed the best with the highest area under the receiver operating characteristic curve (AUC) of 0.96 (95% CI 0.93-0.98) for relevance, all deep learning classifiers including LSTM-CNN performed better than the traditional classifiers with an AUC of 0.99 (95% CI 0.98-0.99) for distinguishing commercial from noncommercial tweets, and BiLSTM performed the best with an AUC of 0.83 (95% CI 0.78-0.89) for provape sentiment. Overall, LSTM-CNN performed the best across all 3 classification tasks. Conclusions We derived and evaluated traditional machine learning and deep learning classifiers to identify vaping-related relevant, commercial, and provape tweets. Overall, deep learning classifiers such as LSTM-CNN had superior performance and had the added advantage of requiring no preprocessing. The performance of these classifiers supports the development of a vaping surveillance system.


2019 ◽  
Author(s):  
Shyam Visweswaran ◽  
Jason B Colditz ◽  
Patrick O’Halloran ◽  
Na-Rae Han ◽  
Sanya B Taneja ◽  
...  

BACKGROUND Twitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers offer the advantage of requiring fewer annotated tweets by leveraging the large numbers of readily available unannotated tweets. OBJECTIVE This study aims to derive and evaluate traditional and deep learning classifiers that can identify tweets relevant to vaping, tweets of a commercial nature, and tweets with provape sentiments. METHODS We continuously collected tweets that matched vaping-related keywords over 2 months from August 2018 to October 2018. From this data set of tweets, a set of 4000 tweets was selected, and each tweet was manually annotated for relevance (vape relevant or not), commercial nature (commercial or not), and sentiment (provape or not). Using the annotated data, we derived traditional classifiers that included logistic regression, random forest, linear support vector machine, and multinomial naive Bayes. In addition, using the annotated data set and a larger unannotated data set of tweets, we derived deep learning classifiers that included a convolutional neural network (CNN), long short-term memory (LSTM) network, LSTM-CNN network, and bidirectional LSTM (BiLSTM) network. The unannotated tweet data were used to derive word vectors that deep learning classifiers can leverage to improve performance. RESULTS LSTM-CNN performed the best with the highest area under the receiver operating characteristic curve (AUC) of 0.96 (95% CI 0.93-0.98) for relevance, all deep learning classifiers including LSTM-CNN performed better than the traditional classifiers with an AUC of 0.99 (95% CI 0.98-0.99) for distinguishing commercial from noncommercial tweets, and BiLSTM performed the best with an AUC of 0.83 (95% CI 0.78-0.89) for provape sentiment. Overall, LSTM-CNN performed the best across all 3 classification tasks. CONCLUSIONS We derived and evaluated traditional machine learning and deep learning classifiers to identify vaping-related relevant, commercial, and provape tweets. Overall, deep learning classifiers such as LSTM-CNN had superior performance and had the added advantage of requiring no preprocessing. The performance of these classifiers supports the development of a vaping surveillance system.


2020 ◽  
Vol 110 (S3) ◽  
pp. S331-S339
Author(s):  
Amelia Jamison ◽  
David A. Broniatowski ◽  
Michael C. Smith ◽  
Kajal S. Parikh ◽  
Adeena Malik ◽  
...  

Objectives. To adapt and extend an existing typology of vaccine misinformation to classify the major topics of discussion across the total vaccine discourse on Twitter. Methods. Using 1.8 million vaccine-relevant tweets compiled from 2014 to 2017, we adapted an existing typology to Twitter data, first in a manual content analysis and then using latent Dirichlet allocation (LDA) topic modeling to extract 100 topics from the data set. Results. Manual annotation identified 22% of the data set as antivaccine, of which safety concerns and conspiracies were the most common themes. Seventeen percent of content was identified as provaccine, with roughly equal proportions of vaccine promotion, criticizing antivaccine beliefs, and vaccine safety and effectiveness. Of the 100 LDA topics, 48 contained provaccine sentiment and 28 contained antivaccine sentiment, with 9 containing both. Conclusions. Our updated typology successfully combines manual annotation with machine-learning methods to estimate the distribution of vaccine arguments, with greater detail on the most distinctive topics of discussion. With this information, communication efforts can be developed to better promote vaccines and avoid amplifying antivaccine rhetoric on Twitter.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
O. Obulesu ◽  
Suresh Kallam ◽  
Gaurav Dhiman ◽  
Rizwan Patan ◽  
Ramana Kadiyala ◽  
...  

Cancer is a complicated worldwide health issue with an increasing death rate in recent years. With the swift blooming of the high throughput technology and several machine learning methods that have unfolded in recent years, progress in cancer disease diagnosis has been made based on subset features, providing awareness of the efficient and precise disease diagnosis. Hence, progressive machine learning techniques that can, fortunately, differentiate lung cancer patients from healthy persons are of great concern. This paper proposes a novel Wilcoxon Signed-Rank Gain Preprocessing combined with Generative Deep Learning called Wilcoxon Signed Generative Deep Learning (WS-GDL) method for lung cancer disease diagnosis. Firstly, test significance analysis and information gain eliminate redundant and irrelevant attributes and extract many informative and significant attributes. Then, using a generator function, the Generative Deep Learning method is used to learn the deep features. Finally, a minimax game (i.e., minimizing error with maximum accuracy) is proposed to diagnose the disease. Numerical experiments on the Thoracic Surgery Data Set are used to test the WS-GDL method's disease diagnosis performance. The WS-GDL approach may create relevant and significant attributes and adaptively diagnose the disease by selecting optimal learning model parameters. Quantitative experimental results show that the WS-GDL method achieves better diagnosis performance and higher computing efficiency in computational time, computational complexity, and false-positive rate compared to state-of-the-art approaches.


Author(s):  
Prof. Manisha Sachin Dabade, Et. al.

In today’s world, social media is viral and easily accessible. The Social media sites like Twitter, Facebook, Tumblr, etc. are a primary and valuable source of information.Twitter is a micro-blogging platform, and it provides an enormous amount of data. Such type of information can use for different sentiment analysis applications such as reviews, predictions, elections, marketing, etc. It is one of the most popular sites where peoples write tweets, retweets, and interact daily. Monitoring and analyzing these tweets give valuable feedback to users. Due to this data's large size, sentiment analysis is using to analyze this data without going through millions of tweets manually. Any user writes their reviews about different products, topics, or events on Twitter, called tweets and retweets. People also use emojis such as happy, sad, and neutral in expressing their emotions, so these sites contain expansive volumes of unprocessed data called raw data. The main goal of this research is to recognize the algorithms by using Machine Learning Classifiers. The study intends to categorize Fine-grain sentiments within Tweets of Vaccination (89974 tweets) through machine learning and a deep learning approach. The study takes consideration of both labeled and unlabeled data. It also detects emojis from tweets using machine learning libraries like Textblob, Vadar, Fast text, Flair, Genism, spaCy, and NLTK.


Different mathematical models, Artificial Intelligence approach and Past recorded data set is combined to formulate Machine Learning. Machine Learning uses different learning algorithms for different types of data and has been classified into three types. The advantage of this learning is that it uses Artificial Neural Network and based on the error rates, it adjusts the weights to improve itself in further epochs. But, Machine Learning works well only when the features are defined accurately. Deciding which feature to select needs good domain knowledge which makes Machine Learning developer dependable. The lack of domain knowledge affects the performance. This dependency inspired the invention of Deep Learning. Deep Learning can detect features through self-training models and is able to give better results compared to using Artificial Intelligence or Machine Learning. It uses different functions like ReLU, Gradient Descend and Optimizers, which makes it the best thing available so far. To efficiently apply such optimizers, one should have the knowledge of mathematical computations and convolutions running behind the layers. It also uses different pooling layers to get the features. But these Modern Approaches need high level of computation which requires CPU and GPUs. In case, if, such high computational power, if hardware is not available then one can use Google Colaboratory framework. The Deep Learning Approach is proven to improve the skin cancer detection as demonstrated in this paper. The paper also aims to provide the circumstantial knowledge to the reader of various practices mentioned above.


SOIL ◽  
2020 ◽  
Vol 6 (2) ◽  
pp. 565-578
Author(s):  
Wartini Ng ◽  
Budiman Minasny ◽  
Wanderson de Sousa Mendes ◽  
José Alexandre Melo Demattê

Abstract. The number of samples used in the calibration data set affects the quality of the generated predictive models using visible, near and shortwave infrared (VIS–NIR–SWIR) spectroscopy for soil attributes. Recently, the convolutional neural network (CNN) has been regarded as a highly accurate model for predicting soil properties on a large database. However, it has not yet been ascertained how large the sample size should be for CNN model to be effective. This paper investigates the effect of the training sample size on the accuracy of deep learning and machine learning models. It aims at providing an estimate of how many calibration samples are needed to improve the model performance of soil properties predictions with CNN as compared to conventional machine learning models. In addition, this paper also looks at a way to interpret the CNN models, which are commonly labelled as a black box. It is hypothesised that the performance of machine learning models will increase with an increasing number of training samples, but it will plateau when it reaches a certain number, while the performance of CNN will keep improving. The performances of two machine learning models (partial least squares regression – PLSR; Cubist) are compared against the CNN model. A VIS–NIR–SWIR spectra library from Brazil, containing 4251 unique sites with averages of two to three samples per depth (a total of 12 044 samples), was divided into calibration (3188 sites) and validation (1063 sites) sets. A subset of the calibration data set was then created to represent a smaller calibration data set ranging from 125, 300, 500, 1000, 1500, 2000, 2500 and 2700 unique sites, which is equivalent to a sample size of approximately 350, 840, 1400, 2800, 4200, 5600, 7000 and 7650. All three models (PLSR, Cubist and CNN) were generated for each sample size of the unique sites for the prediction of five different soil properties, i.e. cation exchange capacity, organic carbon, sand, silt and clay content. These calibration subset sampling processes and modelling were repeated 10 times to provide a better representation of the model performances. Learning curves showed that the accuracy increased with an increasing number of training samples. At a lower number of samples (< 1000), PLSR and Cubist performed better than CNN. The performance of CNN outweighed the PLSR and Cubist model at a sample size of 1500 and 1800, respectively. It can be recommended that deep learning is most efficient for spectra modelling for sample sizes above 2000. The accuracy of the PLSR and Cubist model seems to reach a plateau above sample sizes of 4200 and 5000, respectively, while the accuracy of CNN has not plateaued. A sensitivity analysis of the CNN model demonstrated its ability to determine important wavelengths region that affected the predictions of various soil attributes.


2020 ◽  
Author(s):  
Wasim Ahmed ◽  
Francesc López Seguí ◽  
Josep Vidal-Alaball ◽  
Matthew S Katz

BACKGROUND During the COVID-19 pandemic, a number of conspiracy theories have emerged. A popular theory posits that the pandemic is a hoax and suggests that certain hospitals are “empty.” Research has shown that accepting conspiracy theories increases the likelihood that an individual may ignore government advice about social distancing and other public health interventions. Due to the possibility of a second wave and future pandemics, it is important to gain an understanding of the drivers of misinformation and strategies to mitigate it. OBJECTIVE This study set out to evaluate the #FilmYourHospital conspiracy theory on Twitter, attempting to understand the drivers behind it. More specifically, the objectives were to determine which online sources of information were used as evidence to support the theory, the ratio of automated to organic accounts in the network, and what lessons can be learned to mitigate the spread of such a conspiracy theory in the future. METHODS Twitter data related to the #FilmYourHospital hashtag were retrieved and analyzed using social network analysis across a 7-day period from April 13-20, 2020. The data set consisted of 22,785 tweets and 11,333 Twitter users. The Botometer tool was used to identify accounts with a higher probability of being bots. RESULTS The most important drivers of the conspiracy theory are ordinary citizens; one of the most influential accounts is a Brexit supporter. We found that YouTube was the information source most linked to by users. The most retweeted post belonged to a verified Twitter user, indicating that the user may have had more influence on the platform. There was a small number of automated accounts (bots) and deleted accounts within the network. CONCLUSIONS Hashtags using and sharing conspiracy theories can be targeted in an effort to delegitimize content containing misinformation. Social media organizations need to bolster their efforts to label or remove content that contains misinformation. Public health authorities could enlist the assistance of influencers in spreading antinarrative content.


Sign in / Sign up

Export Citation Format

Share Document