scholarly journals What You Can Scrape and What Is Right to Scrape: A Proposal for a Tool to Collect Public Facebook Data

2020 ◽  
Vol 6 (3) ◽  
pp. 205630512094070 ◽  
Author(s):  
Moreno Mancosu ◽  
Federico Vegetti

In reaction to the Cambridge Analytica scandal, Facebook has restricted the access to its Application Programming Interface (API). This new policy has damaged the possibility for independent researchers to study relevant topics in political and social behavior. Yet, much of the public information that the researchers may be interested in is still available on Facebook, and can be still systematically collected through web scraping techniques. The goal of this article is twofold. First, we discuss some ethical and legal issues that researchers should consider as they plan their collection and possible publication of Facebook data. In particular, we discuss what kind of information can be ethically gathered about the users (public information), how published data should look like to comply with privacy regulations (like the GDPR), and what consequences violating Facebook’s terms of service may entail for the researcher. Second, we present a scraping routine for public Facebook posts, and discuss some technical adjustments that can be performed for the data to be ethically and legally acceptable. The code employs screen scraping to collect the list of reactions to a Facebook public post, and performs a one-way cryptographic hash function on the users’ identifiers to pseudonymize their personal information, while still keeping them traceable within the data. This article contributes to the debate around freedom of internet research and the ethical concerns that might arise by scraping data from the social web.

2017 ◽  
Vol 36 (2) ◽  
pp. 195-211 ◽  
Author(s):  
Patrick Rafail

Twitter data are widely used in the social sciences. The Twitter Application Programming Interface (API) allows researchers to build large databases of user activity efficiently. Despite the potential of Twitter as a data source, less attention has been paid to issues of sampling, and in particular, the implications of different sampling strategies on overall data quality. This research proposes a set of conceptual distinctions between four types of populations that emerge when analyzing Twitter data and suggests sampling strategies that facilitate more comprehensive data collection from the Twitter API. Using three applications drawn from large databases of Twitter activity, this research also compares the results from the proposed sampling strategies, which provide defensible representations of the population of activity, to those collected with more frequently used hashtag samples. The results suggest that hashtag samples misrepresent important aspects of Twitter activity and may lead researchers to erroneous conclusions.


2020 ◽  
Vol 12 (10) ◽  
pp. 4200 ◽  
Author(s):  
Thanh-Long Giang ◽  
Dinh-Tri Vo ◽  
Quan-Hoang Vuong

Using data from the WHO’s Situation Report on the COVID-19 pandemic from 21 January 2020 to 30 March 2020 along with other health, demographic, and macroeconomic indicators from the WHO’s Application Programming Interface and the World Bank’s Development Indicators, this paper explores the death rates of infected persons and their possible associated factors. Through the panel analysis, we found consistent results that healthcare system conditions, particularly the number of hospital beds and medical staff, have played extremely important roles in reducing death rates of COVID-19 infected persons. In addition, both the mortality rates due to different non-communicable diseases (NCDs) and rate of people aged 65 and over were significantly related to the death rates. We also found that controlling international and domestic travelling by air along with increasingly popular anti-COVID-19 actions (i.e., quarantine and social distancing) would help reduce the death rates in all countries. We conducted tests for robustness and found that the Driscoll and Kraay (1998) method was the most suitable estimator with a finite sample, which helped confirm the robustness of our estimations. Based on the findings, we suggest that preparedness of healthcare systems for aged populations need more attentions from the public and politicians, regardless of income level, when facing COVID-19-like pandemics.


2019 ◽  
Vol 15 (3) ◽  
pp. 76-95 ◽  
Author(s):  
Joshua Ofoeda ◽  
Richard Boateng ◽  
John Effah

The purpose of this study is to perform a synthesis of API research. The study took stock of literature from academic journals on APIs with their associated themes, frameworks, methodologies, publication outlets and level of analysis. The authors draw on a total of 104 articles from academic journals and conferences published from 2010 to 2018. A systematic literature review was conducted on the selected articles. The findings suggest that API research is primarily atheoretical and largely focuses on the technological dimensions such as design and usage; thus, neglecting most of the social issues such as the business and managerial applications of APIs, which are equally important. Future research directions are provided concerning the gaps identified.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Casper W. Andersen ◽  
Rickard Armiento ◽  
Evgeny Blokhin ◽  
Gareth J. Conduit ◽  
Shyam Dwaraknath ◽  
...  

AbstractThe Open Databases Integration for Materials Design (OPTIMADE) consortium has designed a universal application programming interface (API) to make materials databases accessible and interoperable. We outline the first stable release of the specification, v1.0, which is already supported by many leading databases and several software packages. We illustrate the advantages of the OPTIMADE API through worked examples on each of the public materials databases that support the full API specification.


2021 ◽  
Vol 11 (1) ◽  
pp. 20
Author(s):  
Mete Ercan Pakdil ◽  
Rahmi Nurhan Çelik

Geospatial data and related technologies have become an increasingly important aspect of data analysis processes, with their prominent role in most of them. Serverless paradigm have become the most popular and frequently used technology within cloud computing. This paper reviews the serverless paradigm and examines how it could be leveraged for geospatial data processes by using open standards in the geospatial community. We propose a system design and architecture to handle complex geospatial data processing jobs with minimum human intervention and resource consumption using serverless technologies. In order to define and execute workflows in the system, we also propose new models for both workflow and task definitions models. Moreover, the proposed system has new Open Geospatial Consortium (OGC) Application Programming Interface (API) Processes specification-based web services to provide interoperability with other geospatial applications with the anticipation that it will be more commonly used in the future. We implemented the proposed system on one of the public cloud providers as a proof of concept and evaluated it with sample geospatial workflows and cloud architecture best practices.


2020 ◽  
Author(s):  
Shubh Mohan Singh ◽  
Chaitanya Reddy

Abstract Objectives: A majority of patients suffering from acute COVID-19 are expected to recover symptomatically and functionally. However there are reports that some people continue to experience symptoms even beyond the stage of acute infection. This phenomenon has been called longcovid. Study design: This study attempted to analyse symptoms reported by users on twitter self-identifying as longcovid. Methods: The search was carried out using the twitter public streaming application programming interface using a relevant search term. Results: We could identify 89 users with usable data in the tweets posted by them. A majority of users described multiple symptoms the most common of which were fatigue, shortness of breath, pain and brainfog/concentration difficulties. The most common course of symptoms was episodic. Conclusions: Given the public health importance of this issue, the study suggests that there is a need to better study post acute-COVID symptoms.


2019 ◽  
Vol 35 (20) ◽  
pp. 4147-4155 ◽  
Author(s):  
Peter Selby ◽  
Rafael Abbeloos ◽  
Jan Erik Backlund ◽  
Martin Basterrechea Salido ◽  
Guillaume Bauchet ◽  
...  

Abstract Motivation Modern genomic breeding methods rely heavily on very large amounts of phenotyping and genotyping data, presenting new challenges in effective data management and integration. Recently, the size and complexity of datasets have increased significantly, with the result that data are often stored on multiple systems. As analyses of interest increasingly require aggregation of datasets from diverse sources, data exchange between disparate systems becomes a challenge. Results To facilitate interoperability among breeding applications, we present the public plant Breeding Application Programming Interface (BrAPI). BrAPI is a standardized web service API specification. The development of BrAPI is a collaborative, community-based initiative involving a growing global community of over a hundred participants representing several dozen institutions and companies. Development of such a standard is recognized as critical to a number of important large breeding system initiatives as a foundational technology. The focus of the first version of the API is on providing services for connecting systems and retrieving basic breeding data including germplasm, study, observation, and marker data. A number of BrAPI-enabled applications, termed BrAPPs, have been written, that take advantage of the emerging support of BrAPI by many databases. Availability and implementation More information on BrAPI, including links to the specification, test suites, BrAPPs, and sample implementations is available at https://brapi.org/. The BrAPI specification and the developer tools are provided as free and open source.


2019 ◽  
Vol 5 (2) ◽  
pp. 88-97
Author(s):  
M. Fuadi Aziz Muri ◽  
Hendrik Setyo Utomo ◽  
Rabini Sayyidati

Application Programming Interface (API) is a function concept that can be called by other programs. The API works as a link that unites various applications of various types of platforms, commonly known as API public names. The public API has been widely spread, while its users, programmers who want to search for public APIs, must browse through various methods such as general search engines, repository documentation or directly in web articles. The user does not yet have a system specifically for collecting public-public APIs, so that users have difficulty in performing API public link searches. The solution to these problems can be solved by building a web framework with a search engine interface that provides specific public-public searches for the API, so that users can search the API public more easily. Web Service is an API that is made to support the interaction between two or more different applications through a network. Representational State Transfer (ReST) is one of the rules.


A ton of vital geographic information about spots, including points of interest, areas and personal information such as neighborhoods, phone numbers etc. can be found on the Internet. However, such information is not openly available using legitimate means. Furthermore, the given information is temperamental as it is static and not refreshed every now and again enough. In this paper, using the results of an internet list, an effective method to manage and collect datasets of spot names is demonstrated. The strategy proposed is to use the Google web crawler Application Programming Interface in order to recoup site pages related with express territory names and types of spots and after that analyses the resultant website pages to remove addresses and names of places. Using the data gathered from internet, the final result compiled is a dataset of spot names. We survey our philosophy by using accumulated data found using street view of Google Maps by examining signs belonging to businesses found in images. The conclusion exhibited by the results was that the modelled procedure efficiently created spot datasets on par with Google Maps and defeated the results of OSM.


2020 ◽  
Vol 41 (S1) ◽  
pp. s101-s101
Author(s):  
Nana Li ◽  
Gondy Leroy ◽  
Fariba Donovan ◽  
John Galgiani ◽  
Katherine Ellingson

Background: Twitter is used by officials to distribute public health messages and by the public to post information about ongoing afflictions. Because tweets originate from geographically and socially diverse sources, scholars have used this social media data to analyze the spread of diseases like flu [Alessio Signorini 2011], asthma [Philip Harber 2019] and mental health disorders [Chandler McClellan, 2017]. To our knowledge, no Twitter analysis has been performed for Valley fever. Valley fever is a fungal infection caused by the Coccidioides organism, mostly found in Arizona and California. Objective: We analyzed tweets concerning Valley fever to evaluate content, location, and timing. Methods: We collected tweets using the Twitter search application programming interface using the terms “Valley fever,” “valleyfever,” “cocci” or “‘Valleyfever” from August 6 to 16, 2019, and again from October 20 to 29, 2019. In total, 2,117 Tweets were retrieved. Tweets not focused on Valley fever were filtered out, including a tweet about “Rift valley fever” and tweets where “valley” and “fever” were separate and not one phrase. We excluded tweets not written in English. In total, 1,533 tweets remained; we grouped them into 3 categories: original tweets, hereafter labeled “normal” (N = 497), retweets (N = 811), and replies (N = 225). We converted all terms to lowercase, removed white space and punctuation, and tokenized the tweets. Informal messaging conventions (eg, hashtag, @user, RT, links) and stop words were removed, and terms were lemmatized. Finally, we analyzed the frequency of tweets by season, state, and co-occurring terms. Results: Tweet frequency was 228.5 per week in summer and 113.4 per week in the fall. Users tweeted from 40 different states; the most common were California (N = 401; 10.1 per 100,00 population) and Arizona (N = 216, 30.1 per 100,000 population), New York (N = 49), Florida (N = 21), and Washington, DC (N = 14). Term frequency analysis showed that for normal tweets, the 5 most frequent terms were “awareness,” “Arizona,” “disease,” “California,” and “people.” For retweets, the most common terms were “Gunner” (a dog name), “vet,” “prayer,” “cough,” and “family.” For replies, they were “dog,” “lung,” “vet,” “day,” and “result.” Several symptoms were mentioned: “cough” (normal: 8, retweets: 104, and replies: 7), “sick” (normal: 21, retweets: 42, replies: 7), “rash” (normal: 2, retweets: 6, replies: 1), and “headache” (normal: 1, retweets: 3, replies: 0). Conclusions: Valley fever tweets are potentially sufficient to track disease intensity, especially in Arizona and California. Data collection over longer intervals is needed to understand the utility of Twitter in this context.Disclosures: NoneFunding: None


Sign in / Sign up

Export Citation Format

Share Document