Utilizing Twitter Data Analysis and Deep Learning to Identify Drug Use (Preprint)

2019 ◽  
Author(s):  
Joseph Tassone ◽  
Peizhi Yan ◽  
Mackenzie Simpson ◽  
Chetan Mendhe ◽  
Vijay Mago ◽  
...  

BACKGROUND The collection and examination of social media has become a useful mechanism for studying the mental activity and behavior tendencies of users. OBJECTIVE Through the analysis of a collected set of Twitter data, a model will be developed for predicting positively referenced, drug-related tweets. From this, trends and correlations can be determined. METHODS Twitter social media tweets and attribute data were collected and processed using topic pertaining keywords, such as drug slang and use-conditions (methods of drug consumption). Potential candidates were preprocessed resulting in a dataset 3,696,150 rows. The predictive classification power of multiple methods was compared including regression, decision trees, and CNN-based classifiers. For the latter, a deep learning approach was implemented to screen and analyze the semantic meaning of the tweets. RESULTS The logistic regression and decision tree models utilized 12,142 data points for training and 1041 data points for testing. The results calculated from the logistic regression models respectively displayed an accuracy of 54.56% and 57.44%, and an AUC of 0.58. While an improvement, the decision tree concluded with an accuracy of 63.40% and an AUC of 0.68. All these values implied a low predictive capability with little to no discrimination. Conversely, the CNN-based classifiers presented a heavy improvement, between the two models tested. The first was trained with 2,661 manually labeled samples, while the other included synthetically generated tweets culminating in 12,142 samples. The accuracy scores were 76.35% and 82.31%, with an AUC of 0.90 and 0.91. Using association rule mining in conjunction with the CNN-based classifier showed a high likelihood for keywords such as “smoke”, “cocaine”, and “marijuana” triggering a drug-positive classification. CONCLUSIONS Predictive analysis without a CNN is limited and possibly fruitless. Attribute-based models presented little predictive capability and were not suitable for analyzing this type of data. The semantic meaning of the tweets needed to be utilized, giving the CNN-based classifier an advantage over other solutions. Additionally, commonly mentioned drugs had a level of correspondence with frequently used illicit substances, proving the practical usefulness of this system. Lastly, the synthetically generated set provided increased scores, improving the predictive capability. CLINICALTRIAL None

2020 ◽  
Vol 20 (S11) ◽  
Author(s):  
Joseph Tassone ◽  
Peizhi Yan ◽  
Mackenzie Simpson ◽  
Chetan Mendhe ◽  
Vijay Mago ◽  
...  

Abstract Background The collection and examination of social media has become a useful mechanism for studying the mental activity and behavior tendencies of users. Through the analysis of a collected set of Twitter data, a model will be developed for predicting positively referenced, drug-related tweets. From this, trends and correlations can be determined. Methods Social media data (tweets and attributes) were collected and processed using topic pertaining keywords, such as drug slang and use-conditions (methods of drug consumption). Potential candidates were preprocessed resulting in a dataset of 3,696,150 rows. The predictive classification power of multiple methods was compared including SVM, XGBoost, BERT and CNN-based classifiers. For the latter, a deep learning approach was implemented to screen and analyze the semantic meaning of the tweets. Results To test the predictive capability of the model, SVM and XGBoost were first employed. The results calculated from the models respectively displayed an accuracy of 59.33% and 54.90%, with AUC’s of 0.87 and 0.71. The values show a low predictive capability with little discrimination. Conversely, the CNN-based classifiers presented a significant improvement, between the two models tested. The first was trained with 2661 manually labeled samples, while the other included synthetically generated tweets culminating in 12,142 samples. The accuracy scores were 76.35% and 82.31%, with an AUC of 0.90 and 0.91. Using association rule mining in conjunction with the CNN-based classifier showed a high likelihood for keywords such as “smoke”, “cocaine”, and “marijuana” triggering a drug-positive classification. Conclusion Predictive analysis with a CNN is promising, whereas attribute-based models presented little predictive capability and were not suitable for analyzing text of data. This research found that the commonly mentioned drugs had a level of correspondence with frequently used illicit substances, proving the practical usefulness of this system. Lastly, the synthetically generated set provided increased accuracy scores and improves the predictive capability.


Author(s):  
Prof. Manisha Sachin Dabade, Et. al.

In today’s world, social media is viral and easily accessible. The Social media sites like Twitter, Facebook, Tumblr, etc. are a primary and valuable source of information.Twitter is a micro-blogging platform, and it provides an enormous amount of data. Such type of information can use for different sentiment analysis applications such as reviews, predictions, elections, marketing, etc. It is one of the most popular sites where peoples write tweets, retweets, and interact daily. Monitoring and analyzing these tweets give valuable feedback to users. Due to this data's large size, sentiment analysis is using to analyze this data without going through millions of tweets manually. Any user writes their reviews about different products, topics, or events on Twitter, called tweets and retweets. People also use emojis such as happy, sad, and neutral in expressing their emotions, so these sites contain expansive volumes of unprocessed data called raw data. The main goal of this research is to recognize the algorithms by using Machine Learning Classifiers. The study intends to categorize Fine-grain sentiments within Tweets of Vaccination (89974 tweets) through machine learning and a deep learning approach. The study takes consideration of both labeled and unlabeled data. It also detects emojis from tweets using machine learning libraries like Textblob, Vadar, Fast text, Flair, Genism, spaCy, and NLTK.


Sebatik ◽  
2020 ◽  
Vol 24 (2) ◽  
Author(s):  
Anifuddin Azis

Indonesia merupakan negara dengan keanekaragaman hayati terbesar kedua di dunia setelah Brazil. Indonesia memiliki sekitar 25.000 spesies tumbuhan dan 400.000 jenis hewan dan ikan. Diperkirakan 8.500 spesies ikan hidup di perairan Indonesia atau merupakan 45% dari jumlah spesies yang ada di dunia, dengan sekitar 7.000an adalah spesies ikan laut. Untuk menentukan berapa jumlah spesies tersebut dibutuhkan suatu keahlian di bidang taksonomi. Dalam pelaksanaannya mengidentifikasi suatu jenis ikan bukanlah hal yang mudah karena memerlukan suatu metode dan peralatan tertentu, juga pustaka mengenai taksonomi. Pemrosesan video atau citra pada data ekosistem perairan yang dilakukan secara otomatis mulai dikembangkan. Dalam pengembangannya, proses deteksi dan identifikasi spesies ikan menjadi suatu tantangan dibandingkan dengan deteksi dan identifikasi pada objek yang lain. Metode deep learning yang berhasil dalam melakukan klasifikasi objek pada citra mampu untuk menganalisa data secara langsung tanpa adanya ekstraksi fitur pada data secara khusus. Sistem tersebut memiliki parameter atau bobot yang berfungsi sebagai ektraksi fitur maupun sebagai pengklasifikasi. Data yang diproses menghasilkan output yang diharapkan semirip mungkin dengan data output yang sesungguhnya.  CNN merupakan arsitektur deep learning yang mampu mereduksi dimensi pada data tanpa menghilangkan ciri atau fitur pada data tersebut. Pada penelitian ini akan dikembangkan model hybrid CNN (Convolutional Neural Networks) untuk mengekstraksi fitur dan beberapa algoritma klasifikasi untuk mengidentifikasi spesies ikan. Algoritma klasifikasi yang digunakan pada penelitian ini adalah : Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree, K-Nearest Neighbor (KNN),  Random Forest, Backpropagation.


2019 ◽  
Vol 11 (01n02) ◽  
pp. 1950002
Author(s):  
Rasim M. Alguliyev ◽  
Ramiz M. Aliguliyev ◽  
Fargana J. Abdullayeva

Recently, data collected from social media enable to analyze social events and make predictions about real events, based on the analysis of sentiments and opinions of users. Most cyber-attacks are carried out by hackers on the basis of discussions on social media. This paper proposes the method that predicts DDoS attacks occurrence by finding relevant texts in social media. To perform high-precision classification of texts to positive and negative classes, the CNN model with 13 layers and improved LSTM method are used. In order to predict the occurrence of the DDoS attacks in the next day, the negative and positive sentiments in social networking texts are used. To evaluate the efficiency of the proposed method experiments were conducted on Twitter data. The proposed method achieved a recall, precision, [Formula: see text]-measure, training loss, training accuracy, testing loss, and test accuracy of 0.85, 0.89, 0.87, 0.09, 0.78, 0.13, and 0.77, respectively.


2021 ◽  
Vol 17 (3) ◽  
pp. 62-74
Author(s):  
Lydia Jane G. ◽  
Seetha Hari

As social media platforms are being increasingly used across the world, there are many prospects to using the data for prediction and analysis. In the Twitter platform, there are discussions about any events, passions, and many more topics. All these discussions are publicly available. This makes Twitter the ultimate source to use the data as an augmentation for the decision support systems. In this paper, the use of GPS tagged tweets for crime prediction is researched. The Twitter data is collected from Chicago and cleaned, and topic modelling is applied to the resultant set. Before topic modelling, an algorithm has been developed to identify tweets that are relevant to the crime prediction problem. Once the relevant tweets are identified, topic modelling is applied to find out the major crimes in the different beats of Chicago. Kernel density estimation (KDE) is applied to traditional data. The result of this and topic modelling are used to predict the crime count for each beat using logistic regression.


2021 ◽  
Author(s):  
Takuma Shibahara ◽  
Chisa Wada ◽  
Yasuho Yamashita ◽  
Kazuhiro Fujita ◽  
Masamichi Sato ◽  
...  

Abstract Breast cancer is the most frequently found cancer in women and the one most often subjected to genetic analysis. Nonetheless, it has been causing the largest number of women's cancer-related deaths. PAM50, the intrinsic subtype assay for breast cancer, is beneficial for diagnosis but does not explain each subtype’s mechanism. Deep learning can predict the subtypes from genetic information more accurately than conventional statistical methods. However, the previous studies did not directly use deep learning to examine which genes associate with the subtypes. To reveal the mechanisms embedded in the PAM50 subtypes, we developed an explainable deep learning model called a point-wise linear model, which uses meta-learning to generate a custom-made logistic regression for each sample. We developed an explainable deep learning model called a point-wise linear model, which uses meta-learning to generate a custom-made logistic regression for each sample. Logistic regression is familiar to physicians, and we can use it to analyze which genes are important for prediction. The custom-made logistic regression models generated by the point-wise linear model used the specific genes selected in other subtypes compared to the conventional logistic regression model: the overlap ratio is less than twenty percent. Analyzing the point-wise linear model’s inner state, we found that the point-wise linear model used genes relevant to the cell cycle-related pathways.


2021 ◽  
Vol 9 (1) ◽  
pp. 1315-1320
Author(s):  
Dr. Mohammed Ali Alhariri

The duplicate fake accounts are detected in this work the data from the social media platform is accessed. The platform choose to use the analysis on social media platform is selected as twitter. The twitter data is accessed using Twitter API, with using some selected features that remain the most appropriate regarding the reason of duplicate fake account. The feature based analysis is compared using machine learning techniques, Random Forest, Decision Tree, and SVM. The performance is further analyzed based on accuracy SVM performed 93.3% accuracy, where decision tree performed as 89.0% and random forest performed as 85.5%. The better performance observed using feature-based analysis is of SVM.  


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Huong T. T. Pham ◽  
Hoa Pham

Abstract Existence conditions for posterior mean of Bayesian logistic regression depend on both chosen prior distributions and a likelihood function. In logistic regression, different patterns of data points can lead to finite maximum likelihood estimates (MLE) or infinite MLE of the regression coefficients. Albert and Anderson [On the existence of maximum likelihood estimates in logistic regression models, Biometrika 71 1984, 1, 1–10] gave definitions of different types of data points, which are complete separation, quasicomplete separation and overlap. Conditions for the existence of the MLE for logistic regression models were proposed under different types of data points. Based on these conditions, we propose the necessary and sufficient conditions for the existence of posterior mean under different choices of prior distributions. In this paper, a general wide class of priors, which are informative priors and non-informative priors having proper distributions and improper distributions, are considered for the existence of posterior mean. In addition, necessary and sufficient conditions for the existence of posterior mean for an individual coefficient is also proposed.


Sign in / Sign up

Export Citation Format

Share Document