IFND: a benchmark dataset for fake news detection

Complex & Intelligent Systems ◽

10.1007/s40747-021-00552-1 ◽

2021 ◽

Author(s):

Dilip Kumar Sharma ◽

Sonal Garg

Keyword(s):

Large Scale ◽

Latent Dirichlet Allocation ◽

Prediction Models ◽

Benchmark Dataset ◽

Fake News ◽

Text And Image ◽

People Detection ◽

Digital Platforms ◽

Augmentation Algorithm ◽

Large Scale Dataset

AbstractSpotting fake news is a critical problem nowadays. Social media are responsible for propagating fake news. Fake news propagated over digital platforms generates confusion as well as induce biased perspectives in people. Detection of misinformation over the digital platform is essential to mitigate its adverse impact. Many approaches have been implemented in recent years. Despite the productive work, fake news identification poses many challenges due to the lack of a comprehensive publicly available benchmark dataset. There is no large-scale dataset that consists of Indian news only. So, this paper presents IFND (Indian fake news dataset) dataset. The dataset consists of both text and images. The majority of the content in the dataset is about events from the year 2013 to the year 2021. Dataset content is scrapped using the Parsehub tool. To increase the size of the fake news in the dataset, an intelligent augmentation algorithm is used. An intelligent augmentation algorithm generates meaningful fake news statements. The latent Dirichlet allocation (LDA) technique is employed for topic modelling to assign the categories to news statements. Various machine learning and deep-learning classifiers are implemented on text and image modality to observe the proposed IFND dataset's performance. A multi-modal approach is also proposed, which considers both textual and visual features for fake news detection. The proposed IFND dataset achieved satisfactory results. This study affirms that the accessibility of such a huge dataset can actuate research in this laborious exploration issue and lead to better prediction models.

Download Full-text

Collecting a Large Scale Dataset for Classifying Fake News Tweets Using Weak Supervision

Future Internet ◽

10.3390/fi13050114 ◽

2021 ◽

Vol 13 (5) ◽

pp. 114

Author(s):

Stefan Helmstetter ◽

Heiko Paulheim

Keyword(s):

Large Scale ◽

Binary Classification ◽

Classification Problem ◽

Training Dataset ◽

Fake News ◽

Weak Supervision ◽

Alternative Approach ◽

Large Scale Dataset ◽

Tweet Classification

The problem of automatic detection of fake news in social media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded as a straight-forward, binary classification problem, the major challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor, and recent approaches utilizing distributional semantics require large training corpora. In this paper, we introduce an alternative approach for creating a large-scale dataset for tweet classification with minimal user intervention. The approach relies on weak supervision and automatically collects a large-scale, but very noisy, training dataset comprising hundreds of thousands of tweets. As a weak supervision signal, we label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean, inaccurate dataset, the results are comparable to those achieved using a manually labeled set of tweets. Moreover, we show that the combination of the large-scale noisy dataset with a human labeled one yields more advantageous results than either of the two alone.

Download Full-text

FANG-COVID: A New Large-Scale Benchmark Dataset for Fake News Detection in German

10.18653/v1/2021.fever-1.9 ◽

2021 ◽

Author(s):

Justus Mattern ◽

Yu Qiao ◽

Elma Kerz ◽

Daniel Wiechmann ◽

Markus Strohmaier

Keyword(s):

Large Scale ◽

Benchmark Dataset ◽

Fake News

Download Full-text

Sentiment Analysis of Users’ Reviews on COVID-19 Contact Tracing Apps with a Benchmark Dataset (Preprint)

10.2196/preprints.28371 ◽

2021 ◽

Author(s):

Kashif Ahmad ◽

Firoj Alam ◽

Junaid Qadir ◽

Basheer Qolomony ◽

Imran Khan ◽

...

Keyword(s):

Sentiment Analysis ◽

Mobile Applications ◽

Large Scale ◽

Benchmark Dataset ◽

Contact Tracing ◽

Future Research ◽

Crowd Sourcing ◽

Large Scale Dataset ◽

And Performance ◽

Different Sources

BACKGROUND Contact tracing has been globally adopted in the fight to control the infection rate of COVID-19. Thanks to digital technologies, such as smartphones and wearable devices, contacts of COVID-19 patients can be easily traced and informed about their potential exposure to the virus. To this aim, several interesting mobile applications have been developed. However, there are ever-growing concerns over the working mechanism and performance of these applications. The literature already provides some interesting exploratory studies on the community’s response to the applications by analyzing information from different sources, such as news and users’ reviews of the applications. However, to the best of our knowledge, there is no existing solution that automatically analyzes users’ reviews and extracts the evoked sentiments. OBJECTIVE In this paper, we analyze how AI models can help in automatically extract and classify the polarity of users’ sentiments and propose a sentiment analysis framework to automatically analyze users’ reviews on COVID-19 contact tracing mobile applications. METHODS we propose a pipeline starting from manual annotation via a crowd-sourcing study and concluding on the development and training of AI models for automatic sentiment analysis of users’ reviews. In detail, we collected and annotated a large-scale dataset of Android and iOS mobile application users’ reviews for COVID-19 contact tracing. After manually analyzing and annotating users’ reviews, we employed both classical (i.e., Naïve Bayes, SVM, Random Forest) and deep learning (i.e., fastText, and different transformers) methods for classification experiments. This resulted in eight different classification models. RESULTS We employed eight different methods on three different tasks achieving up to an average F1-Scores 94.8% indicating the feasibility of automatic sentiment analysis of users’ reviews on the COVID-19 contact tracing applications. Moreover, the crowd-sourcing activity resulted in a large-scale benchmark dataset composed of 34,534 reviews manually annotated from the contract tracing applications of 46 distinct countries. CONCLUSIONS The existing literature mostly relies on the manual/exploratory analysis of users’ reviews on the application, which is a tedious and time-consuming process. Moreover, in the existing studies, generally, data from fewer applications are analyzed. In this work, we showed that automatic sentiment analysis can help in analyzing users’ responses to the application more quickly with significant accuracy. Moreover, we also provided a large-scale benchmark dataset composed of 34,534 reviews from 47 different applications. We believe the presented analysis and the dataset will support future research on the topic.

Download Full-text

Survey of Clustering Methods for Large Scale Dataset

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i5.13381344 ◽

2019 ◽

Vol 7 (5) ◽

pp. 1338-1344

Author(s):

Anupama Jawale ◽

Ganesh Magar

Keyword(s):

Large Scale ◽

Clustering Methods ◽

Large Scale Dataset

Download Full-text

Bioactivity Prediction Based on Matched Molecular Pair and Matched Molecular Series Methods

Current Pharmaceutical Design ◽

10.2174/1381612826666200427111309 ◽

2020 ◽

Vol 26 (33) ◽

pp. 4195-4205

Author(s):

Xiaoyu Ding ◽

Chen Cui ◽

Dingyan Wang ◽

Jihui Zhao ◽

Mingyue Zheng ◽

...

Keyword(s):

Prediction Model ◽

Large Scale ◽

Prediction Models ◽

Predictive Accuracy ◽

Lead Optimization ◽

Consensus Method ◽

Molecular Pair ◽

Bioactivity Prediction ◽

Compound Synthesis ◽

Consensus Modeling

Background: Enhancing a compound’s biological activity is the central task for lead optimization in small molecules drug discovery. However, it is laborious to perform many iterative rounds of compound synthesis and bioactivity tests. To address the issue, it is highly demanding to develop high quality in silico bioactivity prediction approaches, to prioritize such more active compound derivatives and reduce the trial-and-error process. Methods: Two kinds of bioactivity prediction models based on a large-scale structure-activity relationship (SAR) database were constructed. The first one is based on the similarity of substituents and realized by matched molecular pair analysis, including SA, SA_BR, SR, and SR_BR. The second one is based on SAR transferability and realized by matched molecular series analysis, including Single MMS pair, Full MMS series, and Multi single MMS pairs. Moreover, we also defined the application domain of models by using the distance-based threshold. Results: Among seven individual models, Multi single MMS pairs bioactivity prediction model showed the best performance (R2 = 0.828, MAE = 0.406, RMSE = 0.591), and the baseline model (SA) produced the most lower prediction accuracy (R2 = 0.798, MAE = 0.446, RMSE = 0.637). The predictive accuracy could further be improved by consensus modeling (R2 = 0.842, MAE = 0.397 and RMSE = 0.563). Conclusion: An accurate prediction model for bioactivity was built with a consensus method, which was superior to all individual models. Our model should be a valuable tool for lead optimization.

Download Full-text

Plant Disease Recognition: A Large-Scale Benchmark Dataset and a Visual Region and Loss Reweighting Approach

IEEE Transactions on Image Processing ◽

10.1109/tip.2021.3049334 ◽

2021 ◽

Vol 30 ◽

pp. 2003-2015

Author(s):

Xinda Liu ◽

Weiqing Min ◽

Shuhuan Mei ◽

Lili Wang ◽

Shuqiang Jiang

Keyword(s):

Large Scale ◽

Plant Disease ◽

Benchmark Dataset ◽

Visual Region

Download Full-text

Joint regression and learning from pairwise rankings for personalized image aesthetic assessment

Computational Visual Media ◽

10.1007/s41095-021-0207-y ◽

2021 ◽

Author(s):

Jin Zhou ◽

Qing Zhang ◽

Jian-Hao Fan ◽

Wei Sun ◽

Wei-Shi Zheng

Keyword(s):

Large Scale ◽

Assessment Model ◽

Generic Model ◽

Small Subset ◽

Deep Convolutional Neural Networks ◽

Personal Taste ◽

Hinge Loss ◽

Novel Approach ◽

Large Scale Dataset ◽

Image Pairs

AbstractRecent image aesthetic assessment methods have achieved remarkable progress due to the emergence of deep convolutional neural networks (CNNs). However, these methods focus primarily on predicting generally perceived preference of an image, making them usually have limited practicability, since each user may have completely different preferences for the same image. To address this problem, this paper presents a novel approach for predicting personalized image aesthetics that fit an individual user’s personal taste. We achieve this in a coarse to fine manner, by joint regression and learning from pairwise rankings. Specifically, we first collect a small subset of personal images from a user and invite him/her to rank the preference of some randomly sampled image pairs. We then search for the K-nearest neighbors of the personal images within a large-scale dataset labeled with average human aesthetic scores, and use these images as well as the associated scores to train a generic aesthetic assessment model by CNN-based regression. Next, we fine-tune the generic model to accommodate the personal preference by training over the rankings with a pairwise hinge loss. Experiments demonstrate that our method can effectively learn personalized image aesthetic preferences, clearly outperforming state-of-the-art methods. Moreover, we show that the learned personalized image aesthetic benefits a wide variety of applications.

Download Full-text

VIPPrint: Validating Synthetic Image Detection and Source Linking Methods on a Large Scale Dataset of Printed Documents

Journal of Imaging ◽

10.3390/jimaging7030050 ◽

2021 ◽

Vol 7 (3) ◽

pp. 50

Author(s):

Anselmo Ferreira ◽

Ehsan Nowroozi ◽

Mauro Barni

Keyword(s):

Large Scale ◽

State Of The Art ◽

Child Pornography ◽

Forensic Analysis ◽

Synthetic Image ◽

Image Detection ◽

Face Images ◽

Large Scale Dataset ◽

Scanned Images ◽

Analysis Of The Images

The possibility of carrying out a meaningful forensic analysis on printed and scanned images plays a major role in many applications. First of all, printed documents are often associated with criminal activities, such as terrorist plans, child pornography, and even fake packages. Additionally, printing and scanning can be used to hide the traces of image manipulation or the synthetic nature of images, since the artifacts commonly found in manipulated and synthetic images are gone after the images are printed and scanned. A problem hindering research in this area is the lack of large scale reference datasets to be used for algorithm development and benchmarking. Motivated by this issue, we present a new dataset composed of a large number of synthetic and natural printed face images. To highlight the difficulties associated with the analysis of the images of the dataset, we carried out an extensive set of experiments comparing several printer attribution methods. We also verified that state-of-the-art methods to distinguish natural and synthetic face images fail when applied to print and scanned images. We envision that the availability of the new dataset and the preliminary experiments we carried out will motivate and facilitate further research in this area.

Download Full-text

Cerebrospinal fluid metabolomics identifies 19 brain-related phenotype associations

Communications Biology ◽

10.1038/s42003-020-01583-z ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Daniel J. Panyard ◽

Kyeong Mo Kim ◽

Burcu F. Darst ◽

Yuetiva K. Deming ◽

Xiaoyuan Zhong ◽

...

Keyword(s):

Cerebrospinal Fluid ◽

Drug Targets ◽

Large Scale ◽

Prediction Models ◽

Genome Wide Association ◽

Large Samples ◽

Genome Wide ◽

Metabolomic Data ◽

Related Phenotype ◽

Omic Data

AbstractThe study of metabolomics and disease has enabled the discovery of new risk factors, diagnostic markers, and drug targets. For neurological and psychiatric phenotypes, the cerebrospinal fluid (CSF) is of particular importance. However, the CSF metabolome is difficult to study on a large scale due to the relative complexity of the procedure needed to collect the fluid. Here, we present a metabolome-wide association study (MWAS), which uses genetic and metabolomic data to impute metabolites into large samples with genome-wide association summary statistics. We conduct a metabolome-wide, genome-wide association analysis with 338 CSF metabolites, identifying 16 genotype-metabolite associations (metabolite quantitative trait loci, or mQTLs). We then build prediction models for all available CSF metabolites and test for associations with 27 neurological and psychiatric phenotypes, identifying 19 significant CSF metabolite-phenotype associations. Our results demonstrate the feasibility of MWAS to study omic data in scarce sample types.

Download Full-text

ShadingNet: Image Intrinsics by Fine-Grained Shading Decomposition

International Journal of Computer Vision ◽

10.1007/s11263-021-01477-5 ◽

2021 ◽

Author(s):

Anil S. Baslamisli ◽

Partha Das ◽

Hoang-An Le ◽

Sezer Karaoglu ◽

Theo Gevers

Keyword(s):

Neural Network ◽

Large Scale ◽

State Of The Art ◽

Image Decomposition ◽

Natural Environments ◽

Decomposition Algorithms ◽

Ambient Light ◽

Fine Grained ◽

Large Scale Dataset ◽

Direct Illumination

AbstractIn general, intrinsic image decomposition algorithms interpret shading as one unified component including all photometric effects. As shading transitions are generally smoother than reflectance (albedo) changes, these methods may fail in distinguishing strong photometric effects from reflectance variations. Therefore, in this paper, we propose to decompose the shading component into direct (illumination) and indirect shading (ambient light and shadows) subcomponents. The aim is to distinguish strong photometric effects from reflectance variations. An end-to-end deep convolutional neural network (ShadingNet) is proposed that operates in a fine-to-coarse manner with a specialized fusion and refinement unit exploiting the fine-grained shading model. It is designed to learn specific reflectance cues separated from specific photometric effects to analyze the disentanglement capability. A large-scale dataset of scene-level synthetic images of outdoor natural environments is provided with fine-grained intrinsic image ground-truths. Large scale experiments show that our approach using fine-grained shading decompositions outperforms state-of-the-art algorithms utilizing unified shading on NED, MPI Sintel, GTA V, IIW, MIT Intrinsic Images, 3DRMS and SRD datasets.

Download Full-text