scholarly journals Research note: Examining potential bias in large-scale censored data

2021 ◽  
Author(s):  
Jennifer Allen ◽  
Markus Mobius ◽  
David M. Rothschild ◽  
Duncan J. Watts

We examine potential bias in Facebook’s 10-trillion cell URLs dataset, consisting of URLs shared on its platform and their engagement metrics. Despite the unprecedented size of the dataset, it was altered to protect user privacy in two ways: 1) by adding differentially private noise to engagement counts, and 2) by censoring the data with a 100-public-share threshold for a URL’s inclusion. To understand how these alterations affect conclusions drawn from the data, we estimate the preva-lence of fake news in the massive, censored URLs dataset and compare it to an estimate from a smaller, representative dataset. We show that censoring can substantially alter conclusions that are drawn from the Facebook dataset. Because of this 100-public-share threshold, descriptive statis-tics from the Facebook URLs dataset overestimate the share of fake news and news overall by as much as 4X. We conclude with more general implications for censoring data.

2020 ◽  
Vol 29 (3S) ◽  
pp. 638-647 ◽  
Author(s):  
Janine F. J. Meijerink ◽  
Marieke Pronk ◽  
Sophia E. Kramer

Purpose The SUpport PRogram (SUPR) study was carried out in the context of a private academic partnership and is the first study to evaluate the long-term effects of a communication program (SUPR) for older hearing aid users and their communication partners on a large scale in a hearing aid dispensing setting. The purpose of this research note is to reflect on the lessons that we learned during the different development, implementation, and evaluation phases of the SUPR project. Procedure This research note describes the procedures that were followed during the different phases of the SUPR project and provides a critical discussion to describe the strengths and weaknesses of the approach taken. Conclusion This research note might provide researchers and intervention developers with useful insights as to how aural rehabilitation interventions, such as the SUPR, can be developed by incorporating the needs of the different stakeholders, evaluated by using a robust research design (including a large sample size and a longer term follow-up assessment), and implemented widely by collaborating with a private partner (hearing aid dispensing practice chain).


2020 ◽  
Author(s):  
Richard Rogers

Ushering in the contemporary ‘fake news’ crisis, Craig Silverman of Buzzfeed News reported that it outperformed mainstream news on Facebook in the three months prior to the 2016 US presidential elections. Here the report’s methods and findings are revisited for 2020. Examining Facebook user engagement of election-related stories, and applying Silverman’s classification of fake news, it was found that the problem has worsened, implying that the measures undertaken to date have not remedied the issue. If, however, one were to classify ‘fake news’ in a stricter fashion, as Facebook as well as certain media organizations do with the notion of ‘false news’, the scale of the problem shrinks. A smaller scale problem could imply a greater role for fact-checkers (rather than deferring to mass-scale content moderation), while a larger one could lead to the further politicisation of source adjudication, where labelling particular sources broadly as ‘fake’, ‘problematic’ and/or ‘junk’ results in backlash.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Vedran Sekara ◽  
Laura Alessandretti ◽  
Enys Mones ◽  
Håkan Jonsson

AbstractLarge-scale collection of human behavioural data by companies raises serious privacy concerns. We show that behaviour captured in the form of application usage data collected from smartphones is highly unique even in large datasets encompassing millions of individuals. This makes behaviour-based re-identification of users across datasets possible. We study 12 months of data from 3.5 million people from 33 countries and show that although four apps are enough to uniquely re-identify 91.2% of individuals using a simple strategy based on public information, there are considerable seasonal and cultural variations in re-identification rates. We find that people have more unique app-fingerprints during summer months making it easier to re-identify them. Further, we find significant variations in uniqueness across countries, and reveal that American users are the easiest to re-identify, while Finns have the least unique app-fingerprints. We show that differences across countries can largely be explained by two characteristics of the country specific app-ecosystems: the popularity distribution and the size of app-fingerprints. Our work highlights problems with current policies intended to protect user privacy and emphasizes that policies cannot directly be ported between countries. We anticipate this will nuance the discussion around re-identifiability in digital datasets and improve digital privacy.


2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Xiaofeng Wu ◽  
Fangyuan Ren ◽  
Yiming Li ◽  
Zhenwei Chen ◽  
Xiaoling Tao

With the rapid development of the Internet of Things (IoT) technology, it has been widely used in various fields. IoT device as an information collection unit can be built into an information management system with an information processing and storage unit composed of multiple servers. However, a large amount of sensitive data contained in IoT devices is transmitted in the system under the actual wireless network environment will cause a series of security issues and will become inefficient in the scenario where a large number of devices are concurrently accessed. If each device is individually authenticated, the authentication overhead is huge, and the network burden is excessive. Aiming at these problems, we propose a protocol that is efficient authentication for Internet of Things devices in information management systems. In the proposed scheme, aggregated certificateless signcryption is used to complete mutual authentication and encrypted transmission of data, and a cloud server is introduced to ensure service continuity and stability. This scheme is suitable for scenarios where large-scale IoT terminal devices are simultaneously connected to the information management system. It not only reduces the authentication overhead but also ensures the user privacy and data integrity. Through the experimental results and security analysis, it is indicated that the proposed scheme is suitable for information management systems.


2019 ◽  
Vol IV (II) ◽  
pp. 1-6
Author(s):  
Mark Perkins

The huge proliferation of textual (and other data) in digital and organisational sources has led to new techniques of text analysis. The potential thereby unleashed may be underpinned by further theoretical developments to the theory of Discourse Stream Analysis (DSA) as presented here. These include the notion of change in the discourse stream in terms of discourse stream fronts, linguistic elements evolving in real time, and notions of time itself in terms of relative speed, subject orientation and perception. Big data has also given rise to fake news, the manipulation of messages on a large scale. Fake news is conveyed in fake discourse streams and has led to a new field of description and analysis.


Author(s):  
Dilip Kumar Sharma ◽  
Sonal Garg

AbstractSpotting fake news is a critical problem nowadays. Social media are responsible for propagating fake news. Fake news propagated over digital platforms generates confusion as well as induce biased perspectives in people. Detection of misinformation over the digital platform is essential to mitigate its adverse impact. Many approaches have been implemented in recent years. Despite the productive work, fake news identification poses many challenges due to the lack of a comprehensive publicly available benchmark dataset. There is no large-scale dataset that consists of Indian news only. So, this paper presents IFND (Indian fake news dataset) dataset. The dataset consists of both text and images. The majority of the content in the dataset is about events from the year 2013 to the year 2021. Dataset content is scrapped using the Parsehub tool. To increase the size of the fake news in the dataset, an intelligent augmentation algorithm is used. An intelligent augmentation algorithm generates meaningful fake news statements. The latent Dirichlet allocation (LDA) technique is employed for topic modelling to assign the categories to news statements. Various machine learning and deep-learning classifiers are implemented on text and image modality to observe the proposed IFND dataset's performance. A multi-modal approach is also proposed, which considers both textual and visual features for fake news detection. The proposed IFND dataset achieved satisfactory results. This study affirms that the accessibility of such a huge dataset can actuate research in this laborious exploration issue and lead to better prediction models.


2019 ◽  
Vol 34 (3) ◽  
pp. 654-669 ◽  
Author(s):  
Michele Di Maio ◽  
Nathan Fiala

Abstract During survey data collection, respondents’ answers may be influenced by the behavior and characteristics of the enumerator, the so-called enumerator effect. Using a large-scale experiment in Uganda in which the study randomly pairs enumerators and respondents, the study explores for which types of questions the enumerator effect may exist. It is found that the enumerator effect is minimal in many questions, but is large for political preference questions, for which it can account for over 30 percent of the variation in responses. The study then explores which enumerator characteristics, and which of their combination with respondent characteristics, could account for this effect. Finally, the conclusion provides some practical suggestions on how to minimize enumerator effects, and potential bias, in various types of data collections.


Sign in / Sign up

Export Citation Format

Share Document