Unlocking digital archives: cross-disciplinary perspectives on AI and born-digital data

AI & Society ◽

10.1007/s00146-021-01367-x ◽

2022 ◽

Author(s):

Lise Jaillant ◽

Annalina Caputo

Keyword(s):

Partial Information ◽

Machine Learning Algorithms ◽

Digital Data ◽

British Library ◽

Web Pages ◽

Sensitive Information ◽

Digital Archives ◽

Intelligent Support ◽

Cultural Organizations ◽

Technical Issues

AbstractCo-authored by a Computer Scientist and a Digital Humanist, this article examines the challenges faced by cultural heritage institutions in the digital age, which have led to the closure of the vast majority of born-digital archival collections. It focuses particularly on cultural organizations such as libraries, museums and archives, used by historians, literary scholars and other Humanities scholars. Most born-digital records held by cultural organizations are inaccessible due to privacy, copyright, commercial and technical issues. Even when born-digital data are publicly available (as in the case of web archives), users often need to physically travel to repositories such as the British Library or the Bibliothèque Nationale de France to consult web pages. Provided with enough sample data from which to learn and train their models, AI, and more specifically machine learning algorithms, offer the opportunity to improve and ease the access to digital archives by learning to perform complex human tasks. These vary from providing intelligent support for searching the archives to automate tedious and time-consuming tasks. In this article, we focus on sensitivity review as a practical solution to unlock digital archives that would allow archival institutions to make non-sensitive information available. This promise to make archives more accessible does not come free of warnings for potential pitfalls and risks: inherent errors, "black box" approaches that make the algorithm inscrutable, and risks related to bias, fake, or partial information. Our central argument is that AI can deliver its promise to make digital archival collections more accessible, but it also creates new challenges - particularly in terms of ethics. In the conclusion, we insist on the importance of fairness, accountability and transparency in the process of making digital archives more accessible.

Download Full-text

Multi-Agent System Observer: Intelligent Support for Engaged E-Learning

Electronics ◽

10.3390/electronics10121370 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1370

Author(s):

Igor Vuković ◽

Kristijan Kuk ◽

Petar Čisar ◽

Miloš Banđur ◽

Đoko Banđur ◽

...

Keyword(s):

Machine Learning ◽

Machine Learning Algorithms ◽

Intelligent Support ◽

Course Success ◽

States Of Emergency ◽

Learning Platform ◽

Rapid Spread ◽

E Learning ◽

Multi Agent ◽

Observer System

Moodle is a widely deployed distance learning platform that provides numerous opportunities to enhance the learning process. Moodle’s importance in maintaining the continuity of education in states of emergency and other circumstances has been particularly demonstrated in the context of the COVID-19 virus’ rapid spread. However, there is a problem with personalizing the learning and monitoring of students’ work. There is room for upgrading the system by applying data mining and different machine-learning methods. The multi-agent Observer system proposed in our paper supports students engaged in learning by monitoring their work and making suggestions based on the prediction of their final course success, using indicators of engagement and machine-learning algorithms. A novelty is that Observer collects data independently of the Moodle database, autonomously creates a training set, and learns from gathered data. Since the data are anonymized, researchers and lecturers can freely use them for purposes broader than that specified for Observer. The paper shows how the methodology, technologies, and techniques used in Observer provide an autonomous system of personalized assistance for students within Moodle platforms.

Download Full-text

Towards Distributed Association Rule Mining Privacy

Application of Agents and Intelligent Information Technologies - Advances in Intelligent Information Technologies ◽

10.4018/978-1-59904-265-7.ch011 ◽

2011 ◽

pp. 245-271

Author(s):

Mafruz Ashrafi ◽

David Taniar ◽

Kate Smith

Keyword(s):

Data Mining ◽

Data Privacy ◽

Large Data ◽

Digital Data ◽

Sensitive Information ◽

Distributed Data ◽

Data Repositories ◽

Actionable Knowledge ◽

The Cost ◽

Network Technologies

With the advancement of storage, retrieval, and network technologies today, the amount of information available to each organization is literally exploding. Although it is widely recognized that the value of data as an organizational asset often becomes a liability because of the cost to acquire and manage those data is far more than the value that is derived from it. Thus, the success of modern organizations not only relies on their capability to acquire and manage their data but their efficiency to derive useful actionable knowledge from it. To explore and analyze large data repositories and discover useful actionable knowledge from them, modern organizations have used a technique known as data mining, which analyzes voluminous digital data and discovers hidden but useful patterns from such massive digital data. However, discovery of hidden patterns has statistical meaning and may often disclose some sensitive information. As a result, privacy becomes one of the prime concerns in the data-mining research community. Since distributed data mining discovers rules by combining local models from various distributed sites, breaching data privacy happens more often than it does in centralized environments.

Download Full-text

Sentiment Analysis for Scraping of Product Reviews from Multiple Web Pages Using Machine Learning Algorithms

Advances in Intelligent Systems and Computing - Intelligent Systems Design and Applications ◽

10.1007/978-3-030-16660-1_66 ◽

2019 ◽

pp. 677-685

Author(s):

E. Suganya ◽

S. Vijayarani

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Web Pages ◽

Product Reviews

Download Full-text

Qatar Digital Library: A New Phase of Digital Archives

BIBLIOTHEK Forschung und Praxis ◽

10.1515/bfp-2019-2072 ◽

2019 ◽

Vol 43 (3) ◽

pp. 441-446 ◽

Cited By ~ 1

Author(s):

Maryam Ahmed Al-Mutawa

Keyword(s):

Digital Library ◽

British Library ◽

Digital Archives ◽

The Public ◽

Historical Materials ◽

Academic Researchers ◽

The World ◽

History Of ◽

Collaboration Project ◽

New Phase

Abstract Qatar Digital Library is a collaboration project by Qatar Foundation and the British Library to have an open access digital archive which aims to benefit people around the world. QDL offers cultural and historical materials of the Gulf and other regions and make it available online for everyone. The aim of QDL is to improve the understanding of the Islamic world, Arab cultural heritage, and the modern history of the Gulf for the public and the academic researchers.

Download Full-text

Session Fingerprinting in Android via Web-to-App Intercommunication

Security and Communication Networks ◽

10.1155/2018/7352030 ◽

2018 ◽

Vol 2018 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Efthimios Alepis ◽

Constantinos Patsakis

Keyword(s):

Mobile Devices ◽

Context Awareness ◽

Mobile Apps ◽

Personal Data ◽

Mobile App ◽

Web Pages ◽

Sensitive Information ◽

Privacy Concerns ◽

Everyday Lives

The extensive adoption of mobile devices in our everyday lives, apart from facilitating us through their various enhanced capabilities, has also raised serious privacy concerns. While mobile devices are equipped with numerous sensors which offer context-awareness to their installed apps, they can also be exploited to reveal sensitive information when correlated with other data or sources. Companies have introduced a plethora of privacy invasive methods to harvest users’ personal data for profiling and monetizing purposes. Nonetheless, up till now, these methods were constrained by the environment they operate, e.g., browser versus mobile app, and since only a handful of businesses have actual access to both of these environments, the conceivable risks could be calculated and the involved enterprises could be somehow monitored and regulated. This work introduces some novel user deanonymization approaches for device and user fingerprinting in Android. Having Android AOSP as our baseline, we prove that web pages, by using several inherent mechanisms, can cooperate with installed mobile apps to identify which sessions operate in specific devices and consequently further expose users’ privacy.

Download Full-text

Archives, linked data and the digital humanities: increasing access to digitised and born-digital archives via the semantic web

Archival Science ◽

10.1007/s10502-021-09381-0 ◽

2021 ◽

Author(s):

Ashleigh Hawkins

Keyword(s):

Digital Humanities ◽

Linked Data ◽

Large Scale ◽

Scale Up ◽

Digital Data ◽

Archival Data ◽

Digital Archives ◽

The Past ◽

Data Production ◽

Machine Readable

AbstractMass digitisation and the exponential growth of born-digital archives over the past two decades have resulted in an enormous volume of archives and archival data being available digitally. This has produced a valuable but under-utilised source of large-scale digital data ripe for interrogation by scholars and practitioners in the Digital Humanities. However, current digitisation approaches fall short of the requirements of digital humanists for structured, integrated, interoperable, and interrogable data. Linked Data provides a viable means of producing such data, creating machine-readable archival data suited to analysis using digital humanities research methods. While a growing body of archival scholarship and praxis has explored Linked Data, its potential to open up digitised and born-digital archives to the Digital Humanities is under-examined. This article approaches Archival Linked Data from the perspective of the Digital Humanities, extrapolating from both archival and digital humanities Linked Data scholarship to identify the benefits to digital humanists of the production and provision of access to Archival Linked Data. It will consider some of the current barriers preventing digital humanists from being able to experience the benefits of Archival Linked Data evidenced, and to fully utilise archives which have been made available digitally. The article argues for increased collaboration between the two disciplines, challenges individuals and institutions to engage with Linked Data, and suggests the incorporation of AI and low-barrier tools such as Wikidata into the Linked Data production workflow in order to scale up the production of Archival Linked Data as a means of increasing access to and utilisation of digitised and born-digital archives.

Download Full-text

Digital Archiving and Data Stewardship in French Archaeology

Internet Archaeology ◽

10.11141/ia.58.26 ◽

2021 ◽

Author(s):

Amala Marx ◽

◽

Kai Salas Rossenbach ◽

Emmanuelle Bryas ◽

◽

...

Keyword(s):

Data Management ◽

National Level ◽

Digital Transformation ◽

Digital Data ◽

Digital Archiving ◽

Data Stewardship ◽

Technical Issues ◽

Major Shift ◽

Critical Overview

In France, the archaeological sector has undergone a major shift in the last 10 years in terms of digital data creation and management. The digital transformation of the profession and its practices is still in progress and is not uniform. If general policies and laws are now clearly adopted at a national level, then institutional or individual situations are more complex. We can clearly separate the development-led and academic sectors, with reference to the volume of data produced and the challenges faced. A critical overview of the barriers highlights the fact that, beyond technical issues, data management (specifically sharing) is a human challenge in terms of scientific priority and in the adoption of new practices. This article gives an overview of the main questions and issues with reference to major nationwide initiatives.

Download Full-text

Privacy protected text analysis in DataSHIELD

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.289 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Rebecca Wilson ◽

Oliver Butters ◽

Demetris Avraam ◽

Andrew Turner ◽

Paul Burton

Keyword(s):

Text Analysis ◽

Individual Patient Data ◽

Patient Data ◽

British Library ◽

Free Text ◽

Sensitive Information ◽

Proof Of Concept ◽

Text Data ◽

Health Records ◽

Different Sources

ABSTRACT ObjectivesDataSHIELD (www.datashield.ac.uk) was born of the requirement in the biomedical and social sciences to co-analyse individual patient data (microdata) from different sources, without disclosing identity or sensitive information. Under DataSHIELD, raw data never leaves the data provider and no microdata or disclosive information can be seen by the researcher. The analysis is taken to the data - not the data to the analysis. Text data can be very disclosive in the biomedical domain (patient records, GP letters etc). Similar, but different, issues are present in other domains - text could be copyrighted, or have a large IP value, making sharing impractical. ApproachBy treating text in an analogous way to individual patient data we assessed if DataSHIELD could be adapted and implemented for text analysis, and circumvent the key obstacles that currently prevent it. ResultsUsing open digitised text data held by the British Library, a DataSHIELD proof-of-concept infrastructure and prototype DataSHIELD functions for free text analysis were developed. ConclusionsWhilst it is possible to analyse free text within a DataSHIELD infrastructure, the challenge is creating generalised and resilient anti-disclosure methods for free text analysis. There are a range of biomedical and health sciences applications for DataSHIELD methods of privacy protected analysis of free text including analysis of electronic health records and analysis of qualitative data e.g. from social media.

Download Full-text

Detection of Phishing Websites using an Efficient Feature-Based Machine Learning Framework

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.c5909.029320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 2857-2862

Keyword(s):

Machine Learning ◽

Personal Information ◽

Machine Learning Algorithms ◽

Sensitive Information ◽

Cyber Attack ◽

Learning Framework ◽

Internet Users ◽

User Data ◽

Feature Based ◽

Classification Prediction

Phishing is a cyber-attack which is socially engineered to trick naive online users into revealing sensitive information such as user data, login credentials, social security number, banking information etc. Attackers fool the Internet users by posing as a legitimate webpage to retrieve personal information. This can also be done by sending emails posing as reputable companies or businesses. Phishing exploits several vulnerabilities effectively and there is no one solution which protects users from all vulnerabilities. A classification/prediction model is designed based on heuristic features that are extracted from website domain, URL, web protocol, source code to eliminate the drawbacks of existing anti-phishing techniques. In the model we combine some existing solutions such as blacklisting and whitelisting, heuristics and visual-based similarity which provides higher level security. We use the model with different Machine Learning Algorithms, namely Logistic Regression, Decision Trees, K-Nearest Neighbours and Random Forests, and compare the results to find the most efficient machine learning framework.

Download Full-text

Data Mining applied on Web Robots Detection: A Systematic Mapping

10.21528/cbic2021-60 ◽

2021 ◽

Author(s):

Ramon Abilio ◽

Cristiano Garcia ◽

Victor Fernandes

Keyword(s):

Machine Learning ◽

Data Mining ◽

Learning Algorithms ◽

Web Server ◽

Machine Learning Algorithms ◽

Web Pages ◽

Systematic Mapping ◽

Mining Methods ◽

Server Logs ◽

Web Server Logs

Browsing on Internet is part of the world population’s daily routine. The number of web pages is increasing and so is the amount of published content (news, tutorials, images, videos) provided by them. Search engines use web robots to index web contents and to offer better results to their users. However, web robots have also been used for exploiting vulnerabilities in web pages. Thus, monitoring and detecting web robots’ accesses is important in order to keep the web server as safe as possible. Data Mining methods have been applied to web server logs (used as data source) in order to detect web robots. Then, the main objective of this work was to observe evidences of definition or use of web robots detection by analyzing web server-side logs using Data Mining methods. Thus, we conducted a systematic Literature mapping, analyzing papers published between 2013 and 2020. In the systematic mapping, we analyzed 34 studies and they allowed us to better understand the area of web robots detection, mapping what is being done, the data used to perform web robots detection, the tools, and algorithms used in the Literature. From those studies, we extracted 33 machine learning algorithms, 64 features, and 13 tools. This study is helpful for researchers to find machine learning algorithms, features, and tools to detect web robots by analyzing web server logs.

Download Full-text