From Prescription to Description: Mapping the GDPR to a Privacy Policy Corpus Annotation Scheme

The European Union’s General Data Protection Regulation (GDPR) has compelled businesses and other organizations to update their privacy policies to state specific information about their data practices. Simultaneously, researchers in natural language processing (NLP) have developed corpora and annotation schemes for extracting salient information from privacy policies, often independently of specific laws. To connect existing NLP research on privacy policies with the GDPR, we introduce a mapping from GDPR provisions to the OPP-115 annotation scheme, which serves as the basis for a growing number of projects to automatically classify privacy policy text. We show that assumptions made in the annotation scheme about the essential topics for a privacy policy reflect many of the same topics that the GDPR requires in these documents. This suggests that OPP-115 continues to be representative of the anatomy of a legally compliant privacy policy, and that the legal assumptions behind it represent the elements of data processing that ought to be disclosed within a policy for transparency. The correspondences we show between OPP-115 and the GDPR suggest the feasibility of bridging existing computational and legal research on privacy policies, benefiting both areas.

Download Full-text

The Privacy Policy Landscape After the GDPR

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2020-0004 ◽

2020 ◽

Vol 2020 (1) ◽

pp. 47-64 ◽

Cited By ~ 5

Author(s):

Thomas Linden ◽

Rishabh Khandelwal ◽

Hamza Harkous ◽

Kassem Fawaz

Keyword(s):

English Language ◽

User Study ◽

Privacy Policies ◽

General Data Protection Regulation ◽

Data Practices ◽

Before And After ◽

Compliance Requirements ◽

General Data ◽

Transitional Phase ◽

The Eu

AbstractThe EU General Data Protection Regulation (GDPR) is one of the most demanding and comprehensive privacy regulations of all time. A year after it went into effect, we study its impact on the landscape of privacy policies online. We conduct the first longitudinal, in-depth, and at-scale assessment of privacy policies before and after the GDPR. We gauge the complete consumption cycle of these policies, from the first user impressions until the compliance assessment. We create a diverse corpus of two sets of 6,278 unique English-language privacy policies from inside and outside the EU, covering their pre-GDPR and the post-GDPR versions. The results of our tests and analyses suggest that the GDPR has been a catalyst for a major overhaul of the privacy policies inside and outside the EU. This overhaul of the policies, manifesting in extensive textual changes, especially for the EU-based websites, comes at mixed benefits to the users.While the privacy policies have become considerably longer, our user study with 470 participants on Amazon MTurk indicates a significant improvement in the visual representation of privacy policies from the users’ perspective for the EU websites. We further develop a new workflow for the automated assessment of requirements in privacy policies. Using this workflow, we show that privacy policies cover more data practices and are more consistent with seven compliance requirements post the GDPR. We also assess how transparent the organizations are with their privacy practices by performing specificity analysis. In this analysis, we find evidence for positive changes triggered by the GDPR, with the specificity level improving on average. Still, we find the landscape of privacy policies to be in a transitional phase; many policies still do not meet several key GDPR requirements or their improved coverage comes with reduced specificity.

Download Full-text

Improving Readability of Online Privacy Policies through DOOP: A Domain Ontology for Online Privacy

Digital ◽

10.3390/digital1040015 ◽

2021 ◽

Vol 1 (4) ◽

pp. 198-215

Author(s):

Dhiren A. Audich ◽

Rozita Dara ◽

Blair Nonnecke

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Domain Ontology ◽

Online Privacy ◽

Proof Of Concept ◽

Privacy Policies ◽

Privacy Concerns ◽

Novel Approach

Privacy policies play an important part in informing users about their privacy concerns by operating as memorandums of understanding (MOUs) between them and online services providers. Research suggests that these policies are infrequently read because they are often lengthy, written in jargon, and incomplete, making them difficult for most users to understand. Users are more likely to read short excerpts of privacy policies if they pertain directly to their concern. In this paper, a novel approach and a proof-of-concept tool are proposed that reduces the amount of privacy policy text a user has to read. It does so using a domain ontology and natural language processing (NLP) to identify key areas of the policies that users should read to address their concerns and take appropriate action. Using the ontology to locate key parts of privacy policies, average reading times were substantially reduced from 29 to 32 min to 45 s.

Download Full-text

Towards Change Detection in Privacy Policies with Natural Language Processing

10.1109/pst52912.2021.9647767 ◽

2021 ◽

Author(s):

Andrick Adhikari ◽

Rinku Dewri

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Change Detection ◽

Language Processing ◽

Privacy Policies

Download Full-text

Designing and Validating an Annotation Model of Dynamic Modality for English and Spanish: Issues and Problems

10.29007/pc58 ◽

2018 ◽

Author(s):

Julia Lavid ◽

Marta Carretero ◽

Juan Rafael Zamorano

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Reliability Study ◽

Annotation Scheme ◽

High Degree ◽

Difficult Cases

In this paper we set forth an annotation model for dynamic modality in English and Spanish, given its relevance not only for contrastive linguistic purposes, but also for its impact on practical annotation tasks in the Natural Language Processing (NLP) community. An annotation scheme is proposed, which captures both the functional-semantic meanings and the language-specific realisations of dynamic meanings in both languages. The scheme is validated through a reliability study performed on a randomly selected set of one hundred and twenty sentences from the MULTINOT corpus, resulting in a high degree of inter-annotator agreement. We discuss our main findings and give attention to the difficult cases as they are currently being used to develop detailed guidelines for the large-scale annotation of dynamic modality in English and Spanish.

Download Full-text

Sentence, Phrase, and Triple Annotations to Build a Knowledge Graph of Natural Language Processing Contributions—A Trial Dataset

Journal of Data and Information Science ◽

10.2478/jdis-2021-0023 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Jennifer D’Souza ◽

Sören Auer

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Knowledge Graph ◽

Annotation Scheme ◽

Research Knowledge ◽

Two Stage ◽

Robust Model ◽

Multiple Annotators ◽

Initial Scheme

Abstract Purpose This work aims to normalize the NlpContributions scheme (henceforward, NlpContributionGraph) to structure, directly from article sentences, the contributions information in Natural Language Processing (NLP) scholarly articles via a two-stage annotation methodology: 1) pilot stage—to define the scheme (described in prior work); and 2) adjudication stage—to normalize the graphing model (the focus of this paper). Design/methodology/approach We re-annotate, a second time, the contributions-pertinent information across 50 prior-annotated NLP scholarly articles in terms of a data pipeline comprising: contribution-centered sentences, phrases, and triple statements. To this end, specifically, care was taken in the adjudication annotation stage to reduce annotation noise while formulating the guidelines for our proposed novel NLP contributions structuring and graphing scheme. Findings The application of NlpContributionGraph on the 50 articles resulted finally in a dataset of 900 contribution-focused sentences, 4,702 contribution-information-centered phrases, and 2,980 surface-structured triples. The intra-annotation agreement between the first and second stages, in terms of F1-score, was 67.92% for sentences, 41.82% for phrases, and 22.31% for triple statements indicating that with increased granularity of the information, the annotation decision variance is greater. Research limitations NlpContributionGraph has limited scope for structuring scholarly contributions compared with STEM (Science, Technology, Engineering, and Medicine) scholarly knowledge at large. Further, the annotation scheme in this work is designed by only an intra-annotator consensus—a single annotator first annotated the data to propose the initial scheme, following which, the same annotator reannotated the data to normalize the annotations in an adjudication stage. However, the expected goal of this work is to achieve a standardized retrospective model of capturing NLP contributions from scholarly articles. This would entail a larger initiative of enlisting multiple annotators to accommodate different worldviews into a “single” set of structures and relationships as the final scheme. Given that the initial scheme is first proposed and the complexity of the annotation task in the realistic timeframe, our intra-annotation procedure is well-suited. Nevertheless, the model proposed in this work is presently limited since it does not incorporate multiple annotator worldviews. This is planned as future work to produce a robust model. Practical implications We demonstrate NlpContributionGraph data integrated into the Open Research Knowledge Graph (ORKG), a next-generation KG-based digital library with intelligent computations enabled over structured scholarly knowledge, as a viable aid to assist researchers in their day-to-day tasks. Originality/value NlpContributionGraph is a novel scheme to annotate research contributions from NLP articles and integrate them in a knowledge graph, which to the best of our knowledge does not exist in the community. Furthermore, our quantitative evaluations over the two-stage annotation tasks offer insights into task difficulty.

Download Full-text

FindThatQuote: A Question-Answering Web-based System to Locate Quotes using Deep Learning and Natural-Language Processing

10.5121/csit.2021.110909 ◽

2021 ◽

Author(s):

Nathan Ji ◽

Yu Sun

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Success Rate ◽

Language Processing ◽

Question Answering ◽

Specific Information ◽

Web Based ◽

User Friendly ◽

Web Based System

The digital age gives us access to a multitude of both information and mediums in which we can interpret information. A majority of the time, many people find interpreting such information difficult as the medium may not be as user friendly as possible. This project has examined the inquiry of how one can identify specific information in a given text based on a question. This inquiry is intended to streamline one's ability to determine the relevance of a given text relative to his objective. The project has an overall 80% success rate given 10 articles with three questions asked per article. This success rate indicates that this project is likely applicable to those who are asking for content level questions within an article.

Download Full-text

Assessment of the Fairness of Privacy Policies of Mobile Health Apps: Scale Development and Evaluation in Cancer Apps (Preprint)

10.2196/preprints.17134 ◽

2019 ◽

Author(s):

Jaime Benjumea ◽

Jorge Ropero ◽

Octavio Rivera-Romero ◽

Enrique Dorronzoro-Zubiete ◽

Alejandro Carrasco

Keyword(s):

Mobile Health ◽

Mobile Apps ◽

Privacy Policies ◽

Sensitive Data ◽

General Data Protection Regulation ◽

Android Apps ◽

Mobile Health Apps ◽

General Data ◽

Adjusted Scores ◽

Mhealth Apps

BACKGROUND Cancer patients are increasingly using mobile health (mHealth) apps to take control of their health. Many studies have explored their efficiency, content, usability, and adherence; however, these apps have created a new set of privacy challenges, as they store personal and sensitive data. OBJECTIVE The purpose of this study was to refine and evaluate a scale based on the General Data Protection Regulation and assess the fairness of privacy policies of mHealth apps. METHODS Based on the experience gained from our previous work, we redefined some of the items and scores of our privacy scale. Using the new version of our scale, we conducted a case study in which we analyzed the privacy policies of cancer Android apps. A systematic search of cancer mobile apps was performed in the Spanish version of the Google Play website. RESULTS The redefinition of certain items reduced discrepancies between reviewers. Thus, use of the scale was made easier, not only for the reviewers but also for any other potential users of our scale. Assessment of the privacy policies revealed that 29% (9/31) of the apps included in the study did not have a privacy policy, 32% (10/31) had a score over 50 out of a maximum of 100 points, and 39% (12/31) scored fewer than 50 points. CONCLUSIONS In this paper, we present a scale for the assessment of mHealth apps that is an improved version of our previous scale with adjusted scores. The results showed a lack of fairness in the mHealth app privacy policies that we examined, and the scale provides developers with a tool to evaluate their privacy policies.

Download Full-text

Unifying Privacy Policy Detection

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2021-0081 ◽

2021 ◽

Vol 2021 (4) ◽

pp. 480-499

Author(s):

Henry Hosseini ◽

Martin Degeling ◽

Christine Utz ◽

Thomas Hupperich

Keyword(s):

Language Processing ◽

Focal Point ◽

Future Research ◽

Data Sets ◽

Control Mechanisms ◽

Privacy Policies ◽

Data Practices ◽

Specific Analysis ◽

Starting Point ◽

Core Part

Abstract Privacy policies have become a focal point of privacy research. With their goal to reflect the privacy practices of a website, service, or app, they are often the starting point for researchers who analyze the accuracy of claimed data practices, user understanding of practices, or control mechanisms for users. Due to vast differences in structure, presentation, and content, it is often challenging to extract privacy policies from online resources like websites for analysis. In the past, researchers have relied on scrapers tailored to the specific analysis or task, which complicates comparing results across different studies. To unify future research in this field, we developed a toolchain to process website privacy policies and prepare them for research purposes. The core part of this chain is a detector module for English and German, using natural language processing and machine learning to automatically determine whether given texts are privacy or cookie policies. We leverage multiple existing data sets to refine our approach, evaluate it on a recently published longitudinal corpus, and show that it contains a number of misclassified documents. We believe that unifying data preparation for the analysis of privacy policies can help make different studies more comparable and is a step towards more thorough analyses. In addition, we provide insights into common pitfalls that may lead to invalid analyses.

Download Full-text

Context and End-User Privacy Policies in Web Service-Based Applications

Research Anthology on Privatizing and Securing Data ◽

10.4018/978-1-7998-8954-0.ch069 ◽

2021 ◽

pp. 1459-1475

Author(s):

Georgia M. Kapitsaki

Keyword(s):

Vital Role ◽

User Preferences ◽

Future Research ◽

User Privacy ◽

Privacy Policies ◽

Sensitive Data ◽

General Data Protection Regulation ◽

General Data ◽

Service Data ◽

Restful Services

Privacy protection plays a vital role in pervasive and web environments, where users contact applications and services that may require access to their sensitive data. The current legislation, such as the recent European General Data Protection Regulation, is putting more emphasis on user protection and on placing users in the center of privacy choices. SOAP (simple object access protocol)-based and RESTful services may require access to sensitive data for their proper functioning, but users should be able to express their preferences on what should and should not be accessed. In this chapter, the above issues are discussed and a solution is presented for reconciling user preferences expressed in privacy policies and the service data needs tailored to SOAP-based services. A use example is provided and the main open issues providing directions for future research are discussed.

Download Full-text

Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system

BMJ Open ◽

10.1136/bmjopen-2018-023232 ◽

2019 ◽

Vol 9 (4) ◽

pp. e023232 ◽

Cited By ~ 7

Author(s):

Beata Fonferko-Shadrach ◽

Arron S Lacey ◽

Angus Roberts ◽

Ashley Akbari ◽

Simon Thompson ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Seizure Frequency ◽

Extraction System ◽

Health Board ◽

Free Text ◽

Specific Information ◽

Clinical Text ◽

Routinely Collected Data

ObjectiveRoutinely collected healthcare data are a powerful research resource but often lack detailed disease-specific information that is collected in clinical free text, for example, clinic letters. We aim to use natural language processing techniques to extract detailed clinical information from epilepsy clinic letters to enrich routinely collected data.DesignWe used the general architecture for text engineering (GATE) framework to build an information extraction system, ExECT (extraction of epilepsy clinical text), combining rule-based and statistical techniques. We extracted nine categories of epilepsy information in addition to clinic date and date of birth across 200 clinic letters. We compared the results of our algorithm with a manual review of the letters by an epilepsy clinician.SettingDe-identified and pseudonymised epilepsy clinic letters from a Health Board serving half a million residents in Wales, UK.ResultsWe identified 1925 items of information with overall precision, recall and F1 score of 91.4%, 81.4% and 86.1%, respectively. Precision and recall for epilepsy-specific categories were: epilepsy diagnosis (88.1%, 89.0%), epilepsy type (89.8%, 79.8%), focal seizures (96.2%, 69.7%), generalised seizures (88.8%, 52.3%), seizure frequency (86.3%–53.6%), medication (96.1%, 94.0%), CT (55.6%, 58.8%), MRI (82.4%, 68.8%) and electroencephalogram (81.5%, 75.3%).ConclusionsWe have built an automated clinical text extraction system that can accurately extract epilepsy information from free text in clinic letters. This can enhance routinely collected data for research in the UK. The information extracted with ExECT such as epilepsy type, seizure frequency and neurological investigations are often missing from routinely collected data. We propose that our algorithm can bridge this data gap enabling further epilepsy research opportunities. While many of the rules in our pipeline were tailored to extract epilepsy specific information, our methods can be applied to other diseases and also can be used in clinical practice to record patient information in a structured manner.

Download Full-text