ClickbaitTR: Dataset for clickbait detection from Turkish news sites and social media with a comparative analysis via machine learning algorithms

2021 ◽  
pp. 016555152110077
Author(s):  
Şura Genç ◽  
Elif Surer

Clickbait is a strategy that aims to attract people’s attention and direct them to specific content. Clickbait titles, created by the information that is not included in the main content or using intriguing expressions with various text-related features, have become very popular, especially in social media. This study expands the Turkish clickbait dataset that we had constructed for clickbait detection in our proof-of-concept study, written in Turkish. We achieve a 48,060 sample size by adding 8859 tweets and release a publicly available dataset – ClickbaitTR – with its open-source data analysis library. We apply machine learning algorithms such as Artificial Neural Network (ANN), Logistic Regression, Random Forest, Long Short-Term Memory Network (LSTM), Bidirectional Long Short-Term Memory (BiLSTM) and Ensemble Classifier on 48,060 news headlines extracted from Twitter. The results show that the Logistic Regression algorithm has 85% accuracy; the Random Forest algorithm has a performance of 86% accuracy; the LSTM has 93% accuracy; the ANN has 93% accuracy; the Ensemble Classifier has 93% accuracy; and finally, the BiLSTM has 97% accuracy. A thorough discussion is provided for the psychological aspects of clickbait strategy focusing on curiosity and interest arousal. In addition to a successful clickbait detection performance and the detailed analysis of clickbait sentences in terms of language and psychological aspects, this study also contributes to clickbait detection studies with the largest clickbait dataset in Turkish.

2022 ◽  
Vol 2161 (1) ◽  
pp. 012055
Author(s):  
H O Lekshmy ◽  
Dhanyalaxmi Panickar ◽  
Sandhya Harikumar

Abstract Epilepsy is a common neurological disease that affects more than 2 percent of the population globally. An imbalance in brain electrical activities causes unpredictable seizures, which eventually leads to epilepsy. Neurostimulators have the power to intervene in advance and avoid the occurrence of seizures. Its efficiency can be increased with the help of heuristics like advanced seizure prediction. Early identification of preictal state will help easy activation of neurostimulator on time. This research concentrates on the performance analysis of various machine learning algorithms on recorded EEG data. Through this study, we aim to find the best model, which can be used to create an ensemble model for better learning. This involves modeling and simulation of classical machine learning technique like Logistic regression, Naive Bayes model, K nearest neighbors Random Forest, and deep learning techniques like an Artificial neural network, Convolutional neural networks, Long short term memory, and Autoencoders. In this analysis, Random Forest and Long Short-Term Memory performed well among all models in terms of sensitivity and specificity.


Author(s):  
Dyapa Sravan Reddy ◽  
Lakshmi Prasanna Reddy ◽  
Kandibanda Sai Santhosh ◽  
Virrat Devaser

SEO Analyst pays a lot of time finding relevant tags for their articles and in some cases, they are unaware of the content topics. The current proposed ML model will recommend content-related tags so that the Content writers/SEO analyst will be having an overview regarding the content and minimizes their time spent on unknown articles. Machine Learning algorithms have a plethora of applications and the extent of their real-life implementations cannot be estimated. Using algorithms like One vs Rest (OVR), Long Short-Term Memory (LSTM), this study has analyzed how Machine Learning can be useful for tag suggestions for a topic. The training of the model with One vs Rest turned out to deliver more accurate results than others. This Study certainly answers how One vs Rest is used for tag suggestions that are needed to promote a website and further studies are required to suggest keywords required.


2021 ◽  
Author(s):  
Hyeon Kang ◽  
Kyung Won Park ◽  
Do-Young Kang

Abstract Single amyloid-beta (Aβ) imaging test is not enough to rise to the challenge of making AD diagnosis because of Aβ-negative AD or positive cognitively normal (CN). We aimed to distinguish AD from CN with dual-phase 18F-Florbetaben (FBB) via machine learning algorithms and evaluate the AD positivity scores compared to delay-phase FBB (dFBB) which is currently adopted for AD diagnosis.A total of 264 patients (74 CN and 190 AD), who underwent FBB imaging test and neuropsychological tests were retrospectively analyzed. We compared three kinds of machine learning-based models and evaluated their performance with 4-fold cross validation.AD positivity scores estimated from dual-phase FBB showed better accuracy (ACC) and area under the receiver operating characteristic curve (AUROC) for AD detection (ACC: 84.091 %, AUROC: 0.900) than those from dFBB imaging (ACC: 81.364 %, AUROC: 0.890). The association between predicted AD positivity and the AD occurrence were compared, the use of dual-phase FBB was highest (OR: 56.333), followed by dFBB (OR: 35.182).These results show that the combined model which interpret dual-phase FBB with long short-term memory can be used to provide a more accurate AD positivity score, which shows a closer association with AD, than the prediction with only single-phase FBB.


Author(s):  
Suleka Helmini ◽  
Nadheesh Jihan ◽  
Malith Jayasinghe ◽  
Srinath Perera

In the retail domain, estimating the sales before actual sales become known plays a key role in maintaining a successful business. This is due to the fact that most crucial decisions are bound to be based on these forecasts. Statistical sales forecasting models like ARIMA (Auto-Regressive Integrated Moving Average), can be identified as one of the most traditional and commonly used forecasting methodologies. Even though these models are capable of producing satisfactory forecasts for linear time series data they are not suitable for analyzing non-linear data. Therefore, machine learning models (such as Random Forest Regression, XGBoost) have been employed frequently as they were able to achieve better results using non-linear data. The recent research shows that deep learning models (e.g. recurrent neural networks) can provide higher accuracy in predictions compared to machine learning models due to their ability to persist information and identify temporal relationships. In this paper, we adopt a special variant of Long Short Term Memory (LSTM) network called LSTM model with peephole connections for sales prediction. We first build our model using historical features for sales forecasting. We compare the results of this initial LSTM model with multiple machine learning models, namely, the Extreme Gradient Boosting model (XGB) and Random Forest Regressor model(RFR). We further improve the prediction accuracy of the initial model by incorporating features that describe the future that is known to us in the current moment, an approach that has not been explored in previous state-of-the-art LSTM based forecasting models. The initial LSTM model we develop outperforms the machine learning models achieving 12% - 14% improvement whereas the improved LSTM model achieves 11\% - 13\% improvement compared to the improved machine learning models. Furthermore, we also show that our improved LSTM model can obtain a 20% - 21% improvement compared to the initial LSTM model, achieving significant improvement.


2019 ◽  
Author(s):  
Suleka Helmini ◽  
Nadheesh Jihan ◽  
Malith Jayasinghe ◽  
Srinath Perera

In the retail domain, estimating the sales before actual sales become known plays a key role in maintaining a successful business. This is due to the fact that most crucial decisions are bound to be based on these forecasts. Statistical sales forecasting models like ARIMA (Auto-Regressive Integrated Moving Average), can be identified as one of the most traditional and commonly used forecasting methodologies. Even though these models are capable of producing satisfactory forecasts for linear time series data they are not suitable for analyzing non-linear data. Therefore, machine learning models (such as Random Forest Regression, XGBoost) have been employed frequently as they were able to achieve better results using non-linear data. The recent research shows that deep learning models (e.g. recurrent neural networks) can provide higher accuracy in predictions compared to machine learning models due to their ability to persist information and identify temporal relationships. In this paper, we adopt a special variant of Long Short Term Memory (LSTM) network called LSTM model with peephole connections for sales prediction. We first build our model using historical features for sales forecasting. We compare the results of this initial LSTM model with multiple machine learning models, namely, the Extreme Gradient Boosting model (XGB) and Random Forest Regressor model(RFR). We further improve the prediction accuracy of the initial model by incorporating features that describe the future that is known to us in the current moment, an approach that has not been explored in previous state-of-the-art LSTM based forecasting models. The initial LSTM model we develop outperforms the machine learning models achieving 12% - 14% improvement whereas the improved LSTM model achieves 11\% - 13\% improvement compared to the improved machine learning models. Furthermore, we also show that our improved LSTM model can obtain a 20% - 21% improvement compared to the initial LSTM model, achieving significant improvement.


2020 ◽  
Vol 12 (11) ◽  
pp. 4471 ◽  
Author(s):  
Jack Ngarambe ◽  
Amina Irakoze ◽  
Geun Young Yun ◽  
Gon Kim

The performance of machine learning (ML) algorithms depends on the nature of the problem at hand. ML-based modeling, therefore, should employ suitable algorithms where optimum results are desired. The purpose of the current study was to explore the potential applications of ML algorithms in modeling daylight in indoor spaces and ultimately identify the optimum algorithm. We thus developed and compared the performance of four common ML algorithms: generalized linear models, deep neural networks, random forest, and gradient boosting models in predicting the distribution of indoor daylight illuminances. We found that deep neural networks, which showed a determination of coefficient (R2) of 0.99, outperformed the other algorithms. Additionally, we explored the use of long short-term memory to forecast the distribution of daylight at a particular future time. Our results show that long short-term memory is accurate and reliable (R2 = 0.92). Our findings provide a basis for discussions on ML algorithms’ use in modeling daylight in indoor spaces, which may ultimately result in efficient tools for estimating daylight performance in the primary stages of building design and daylight control schemes for energy efficiency.


2021 ◽  
Vol 7 ◽  
pp. e645
Author(s):  
Ramish Jamil ◽  
Imran Ashraf ◽  
Furqan Rustam ◽  
Eysha Saad ◽  
Arif Mehmood ◽  
...  

Sarcasm emerges as a common phenomenon across social networking sites because people express their negative thoughts, hatred and opinions using positive vocabulary which makes it a challenging task to detect sarcasm. Although various studies have investigated the sarcasm detection on baseline datasets, this work is the first to detect sarcasm from a multi-domain dataset that is constructed by combining Twitter and News Headlines datasets. This study proposes a hybrid approach where the convolutional neural networks (CNN) are used for feature extraction while the long short-term memory (LSTM) is trained and tested on those features. For performance analysis, several machine learning algorithms such as random forest, support vector classifier, extra tree classifier and decision tree are used. The performance of both the proposed model and machine learning algorithms is analyzed using the term frequency-inverse document frequency, bag of words approach, and global vectors for word representations. Experimental results indicate that the proposed model surpasses the performance of the traditional machine learning algorithms with an accuracy of 91.60%. Several state-of-the-art approaches for sarcasm detection are compared with the proposed model and results suggest that the proposed model outperforms these approaches concerning the precision, recall and F1 scores. The proposed model is accurate, robust, and performs sarcasm detection on a multi-domain dataset.


Abstract. Predictive models are important to help manage high-value assets and to ensure optimal and safe operations. Recently, advanced machine learning algorithms have been applied to solve practical and complex problems, and are of significant interest due to their ability to adaptively ‘learn’ in response to changing environments. This paper reports on the data preparation strategies and the development and predictive capability of a Long Short-Term Memory recurrent neural network model for anaerobic reactors employed at Melbourne Water’s Western Treatment Plant for sewage treatment that includes biogas harvesting. The results show rapid training and higher accuracy in predicting biogas production when historical data, which include significant outliers, are preprocessed with z-score standardisation in comparison to those with max-min normalisation. Furthermore, a trained model with a reduced number of input variables via the feature selection technique based on Pearson’s correlation coefficient is found to yield good performance given sufficient dataset training. It is shown that the overall best performance model comprises the reduced input variables and data processed with z-score standardisation. This initial study provides a useful guide for the implementation of machine learning techniques to develop smarter structures and management towards Industry 4.0 concepts.


Sign in / Sign up

Export Citation Format

Share Document