scholarly journals Collecting a Large Scale Dataset for Classifying Fake News Tweets Using Weak Supervision

2021 ◽  
Vol 13 (5) ◽  
pp. 114
Author(s):  
Stefan Helmstetter ◽  
Heiko Paulheim

The problem of automatic detection of fake news in social media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded as a straight-forward, binary classification problem, the major challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor, and recent approaches utilizing distributional semantics require large training corpora. In this paper, we introduce an alternative approach for creating a large-scale dataset for tweet classification with minimal user intervention. The approach relies on weak supervision and automatically collects a large-scale, but very noisy, training dataset comprising hundreds of thousands of tweets. As a weak supervision signal, we label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean, inaccurate dataset, the results are comparable to those achieved using a manually labeled set of tweets. Moreover, we show that the combination of the large-scale noisy dataset with a human labeled one yields more advantageous results than either of the two alone.

2021 ◽  
Vol 13 (5) ◽  
pp. 905
Author(s):  
Chuyi Wu ◽  
Feng Zhang ◽  
Junshi Xia ◽  
Yichen Xu ◽  
Guoqing Li ◽  
...  

The building damage status is vital to plan rescue and reconstruction after a disaster and is also hard to detect and judge its level. Most existing studies focus on binary classification, and the attention of the model is distracted. In this study, we proposed a Siamese neural network that can localize and classify damaged buildings at one time. The main parts of this network are a variety of attention U-Nets using different backbones. The attention mechanism enables the network to pay more attention to the effective features and channels, so as to reduce the impact of useless features. We train them using the xBD dataset, which is a large-scale dataset for the advancement of building damage assessment, and compare their result balanced F (F1) scores. The score demonstrates that the performance of SEresNeXt with an attention mechanism gives the best performance, with the F1 score reaching 0.787. To improve the accuracy, we fused the results and got the best overall F1 score of 0.792. To verify the transferability and robustness of the model, we selected the dataset on the Maxar Open Data Program of two recent disasters to investigate the performance. By visual comparison, the results show that our model is robust and transferable.


2018 ◽  
Vol 2 (334) ◽  
Author(s):  
Mirosław Krzyśko ◽  
Łukasz Smaga

In this paper, the binary classification problem of multi‑dimensional functional data is considered. To solve this problem a regression technique based on functional logistic regression model is used. This model is re‑expressed as a particular logistic regression model by using the basis expansions of functional coefficients and explanatory variables. Based on re‑expressed model, a classification rule is proposed. To handle with outlying observations, robust methods of estimation of unknown parameters are also considered. Numerical experiments suggest that the proposed methods may behave satisfactory in practice.


Author(s):  
Dilip Kumar Sharma ◽  
Sonal Garg

AbstractSpotting fake news is a critical problem nowadays. Social media are responsible for propagating fake news. Fake news propagated over digital platforms generates confusion as well as induce biased perspectives in people. Detection of misinformation over the digital platform is essential to mitigate its adverse impact. Many approaches have been implemented in recent years. Despite the productive work, fake news identification poses many challenges due to the lack of a comprehensive publicly available benchmark dataset. There is no large-scale dataset that consists of Indian news only. So, this paper presents IFND (Indian fake news dataset) dataset. The dataset consists of both text and images. The majority of the content in the dataset is about events from the year 2013 to the year 2021. Dataset content is scrapped using the Parsehub tool. To increase the size of the fake news in the dataset, an intelligent augmentation algorithm is used. An intelligent augmentation algorithm generates meaningful fake news statements. The latent Dirichlet allocation (LDA) technique is employed for topic modelling to assign the categories to news statements. Various machine learning and deep-learning classifiers are implemented on text and image modality to observe the proposed IFND dataset's performance. A multi-modal approach is also proposed, which considers both textual and visual features for fake news detection. The proposed IFND dataset achieved satisfactory results. This study affirms that the accessibility of such a huge dataset can actuate research in this laborious exploration issue and lead to better prediction models.


Author(s):  
Siti Mariyam Shamsuddin ◽  
Anazida Zainal ◽  
Norfadzila Mohd Yusof

Clustering is the procedure of recognising classes of patterns that occur in the environment and assigning each pattern to its relevant. Unlike classical statistical methods, self-organising map (SOM) does not require any prior knowledge about the statistical distribution of the patterns in the environment. In this study, an alternative classification of self-organising neural networks, known as multilevel learning, was proposed to solve the task of pattern separation. The performance of standard SOM and multilevel SOM were evaluated with different distance or dissimilarity measures in retrieving similarity between patterns. The purpose of this analysis was to evaluate the quality of map produced by SOM learning using different distance measures in representing a given dataset. Based on the results obtained from both SOM methods, predictions can be made for the unknown samples. The results showed that multilevel SOM learning gives better classification rate for small and medium scale datasets, but not for large scale dataset.


2020 ◽  
Vol 34 (05) ◽  
pp. 8838-8845
Author(s):  
Xiaoming Shi ◽  
Haifeng Hu ◽  
Wanxiang Che ◽  
Zhongqian Sun ◽  
Ting Liu ◽  
...  

In this work, we consider the medical slot filling problem, i.e., the problem of converting medical queries into structured representations which is a challenging task. We analyze the effectiveness of two points: scattered keywords in user utterances and weak supervision with responses. We approach the medical slot filling as a multi-label classification problem with label-embedding attentive model to pay more attention to scattered medical keywords and learn the classification models by weak-supervision from responses. To evaluate the approaches, we annotate a medical slot filling data and collect a large scale unlabeled data. The experiments demonstrate that these two points are promising to improve the task.


Author(s):  
M. A. Zurbaran ◽  
P. Wightman ◽  
M. A. Brovelli

<p><strong>Abstract.</strong> Satellite imagery from earth observation missions enable processing big data to gather information about the world. Automatizing the creation of maps that reflect ground truth is a desirable outcome that would aid decision makers to take adequate actions in alignment with the United Nations Sustainable Development Goals. In order to harness the power that the availability of the new generation of satellites enable, it is necessary to implement techniques capable of handling annotations for the massive volume and variability of high spatial resolution imagery for further processing. However, the availability of public datasets for training machine learning models for image segmentation plays an important role for scalability.</p><p>This work focuses on bridging remote sensing and computer vision by providing an open source based pipeline for generating machine learning training datasets for road detection in an area of interest. The proposed pipeline addresses road detection as a binary classification problem using road annotations existing in OpenStreetMap for creating masks. For this case study, Planet images of 3m resolution are used for creating a training dataset for road detection in Kenya.</p>


2022 ◽  
Vol 2161 (1) ◽  
pp. 012074
Author(s):  
Hemavati ◽  
V Susheela Devi ◽  
R Aparna

Abstract Nowadays, multi-label classification can be considered as one of the important challenges for classification problem. In this case instances are assigned more than one class label. Ensemble learning is a process of supervised learning where several classifiers are trained to get a better solution for a given problem. Feature reduction can be used to improve the classification accuracy by considering the class label information with principal Component Analysis (PCA). In this paper, stacked ensemble learning method with augmented class information PCA (CA PCA) is proposed for classification of multi-label data (SEMML). In the initial step, the dimensionality reduction step is applied, then the number of classifiers have to be chosen to apply on the original training dataset, then the stacking method is applied to it. By observing the results of experiments conducted are showing our proposed method is working better as compared to the existing methods.


2020 ◽  
Vol 11 (8-2020) ◽  
pp. 176-178
Author(s):  
B.S. Darkhovsky ◽  
◽  
Y.A. Dubnov ◽  
A.Y. Popkov ◽  
◽  
...  

This work is devoted to a new model-free approach to a problem of binary classification of multivariate time-series. The approach is based on the original theory of epsilon-complexity which allows almost every mapping that satisfies Hoelder condition, be characterized by a pair of real numbers –complexity coefficients. Thus we can form a feature space in which a classification problem can be formulated and solved. We provide an example of classification of real EEG signals.


2021 ◽  
Vol 13 (7) ◽  
pp. 1404
Author(s):  
Hongying Liu ◽  
Derong Xu ◽  
Tianwen Zhu ◽  
Fanhua Shang ◽  
Yuanyuan Liu ◽  
...  

Classification of polarimetric synthetic aperture radar (PolSAR) images has achieved good results due to the excellent fitting ability of neural networks with a large number of training samples. However, the performance of most convolutional neural networks (CNNs) degrades dramatically when only a few labeled training samples are available. As one well-known class of semi-supervised learning methods, graph convolutional networks (GCNs) have gained much attention recently to address the classification problem with only a few labeled samples. As the number of layers grows in the network, the parameters dramatically increase. It is challenging to determine an optimal architecture manually. In this paper, we propose a neural architecture search method based GCN (ASGCN) for the classification of PolSAR images. We construct a novel graph whose nodes combines both the physical features and spatial relations between pixels or samples to represent the image. Then we build a new searching space whose components are empirically selected from some graph neural networks for architecture search and develop the differentiable architecture search method to construction our ASGCN. Moreover, to address the training of large-scale images, we present a new weighted mini-batch algorithm to reduce the computing memory consumption and ensure the balance of sample distribution, and also analyze and compare with other similar training strategies. Experiments on several real-world PolSAR datasets show that our method has improved the overall accuracy as much as 3.76% than state-of-the-art methods.


Sign in / Sign up

Export Citation Format

Share Document