Data Science | ScienceGate

Packaging research artefacts with RO-Crate

Data Science ◽

10.3233/ds-210053 ◽

2022 ◽

pp. 1-42

Author(s):

Stian Soiland-Reyes ◽

Peter Sefton ◽

Mercè Crosas ◽

Leyla Jael Castro ◽

Frederik Coppens ◽

...

Keyword(s):

Best Practices ◽

Digital Humanities ◽

Linked Data ◽

General Purpose ◽

Automated Systems ◽

Data Standards ◽

Research Outcome ◽

Machine Readable ◽

Regulatory Sciences

An increasing number of researchers support reproducibility by including pointers to and descriptions of datasets, software and methods in their publications. However, scientific articles may be ambiguous, incomplete and difficult to process by automated systems. In this paper we introduce RO-Crate, an open, community-driven, and lightweight approach to packaging research artefacts along with their metadata in a machine readable manner. RO-Crate is based on Schema.org annotations in JSON-LD, aiming to establish best practices to formally describe metadata in an accessible and practical way for their use in a wide variety of situations. An RO-Crate is a structured archive of all the items that contributed to a research outcome, including their identifiers, provenance, relations and annotations. As a general purpose packaging approach for data and their metadata, RO-Crate is used across multiple areas, including bioinformatics, digital humanities and regulatory sciences. By applying “just enough” Linked Data standards, RO-Crate simplifies the process of making research outputs FAIR while also enhancing research reproducibility. An RO-Crate for this article11 https://w3id.org/ro/doi/10.5281/zenodo.5146227 is archived at https://doi.org/10.5281/zenodo.5146227.

Download Full-text

Implementation of marketplace data in the production of Consumer Price Index in Indonesia

Data Science ◽

10.3233/ds-210037 ◽

2021 ◽

pp. 1-17

Author(s):

Muhammad Ghozy Al Haqqoni ◽

Setia Pramana

Keyword(s):

Southeast Asia ◽

Price Index ◽

Consumer Price Index ◽

Government Agency ◽

Digital Economy ◽

The Internet ◽

Group Level ◽

Web Scraping ◽

Online Marketplaces

Digital Economy in recent years, especially in Southeast Asia, including Indonesia, is growing rapidly. E-commerce is one part of the Digital Economy. BPS-Statistics Indonesia as a Non-ministerial Government Agency responsible directly to the president has conducted an E-commerce Survey in 2019. From this publication, it is concluded that the interest of Indonesian traders using the internet in selling in recent years has increased. So, the urgency of using e-commerce data in its application in official Statistics is increasingly needed. Several studies have carried out the application of e-commerce data in the calculation of The Consumer Price Index (CPI). In this research, e-commerce data is applied with a case study using the data from one of online marketplaces in Indonesia in calculating CPI at city level in Java. The purpose of this study is to compare the marketplace-based CPI data and BPS-Statistics’ survey-based CPI. The data is collected through web scraping techniques and followed by preprocessing data and analyzed descriptively. Web scraper that is built can be used in obtaining data. Commodity-level CPI with marketplace data tends to have relatively large prices which result in higher CPI being compared to BPS-Statistics CPI. Meanwhile, at the expenditure group level, the CPI between the two approaches is broadly similar in general.

Download Full-text

Arangopipe, a tool for machine learning meta-data management

Data Science ◽

10.3233/ds-210034 ◽

2021 ◽

pp. 1-15

Author(s):

Jörg Schad ◽

Rajiv Sambasivan ◽

Christopher Woodward

Keyword(s):

Machine Learning ◽

Life Cycle ◽

Open Source ◽

Data Model ◽

Application Programming Interface ◽

Learning Models ◽

Essential Components ◽

Application Programming ◽

Programming Interface ◽

Machine Learning Models

Experimenting with different models, documenting results and findings, and repeating these tasks are day-to-day activities for machine learning engineers and data scientists. There is a need to keep control of the machine-learning pipeline and its metadata. This allows users to iterate quickly through experiments and retrieve key findings and observations from historical activity. This is the need that Arangopipe serves. Arangopipe is an open-source tool that provides a data model that captures the essential components of any machine learning life cycle. Arangopipe provides an application programming interface that permits machine-learning engineers to record the details of the salient steps in building their machine learning models. The components of the data model and an overview of the application programming interface is provided. Illustrative examples of basic and advanced machine learning workflows are provided. Arangopipe is not only useful for users involved in developing machine learning models but also useful for users deploying and maintaining them.

Download Full-text

A systematic review on privacy-preserving distributed data mining

Data Science ◽

10.3233/ds-210036 ◽

2021 ◽

Vol 4 (2) ◽

pp. 121-150

Author(s):

Chang Sun ◽

Lianne Ippel ◽

Andre Dekker ◽

Michel Dumontier ◽

Johan van Soest

Keyword(s):

Systematic Review ◽

Data Mining ◽

Real Life ◽

Past Research ◽

Privacy Preserving ◽

Distributed Data Mining ◽

Sensitive Information ◽

Distributed Data ◽

Multiple Sources ◽

Privacy And Security

Combining and analysing sensitive data from multiple sources offers considerable potential for knowledge discovery. However, there are a number of issues that pose problems for such analyses, including technical barriers, privacy restrictions, security concerns, and trust issues. Privacy-preserving distributed data mining techniques (PPDDM) aim to overcome these challenges by extracting knowledge from partitioned data while minimizing the release of sensitive information. This paper reports the results and findings of a systematic review of PPDDM techniques from 231 scientific articles published in the past 20 years. We summarize the state of the art, compare the problems they address, and identify the outstanding challenges in the field. This review identifies the consequence of the lack of standard criteria to evaluate new PPDDM methods and proposes comprehensive evaluation criteria with 10 key factors. We discuss the ambiguous definitions of privacy and confusion between privacy and security in the field, and provide suggestions of how to make a clear and applicable privacy description for new PPDDM techniques. The findings from our review enhance the understanding of the challenges of applying theoretical PPDDM methods to real-life use cases, and the importance of involving legal-ethical and social experts in implementing PPDDM methods. This comprehensive review will serve as a helpful guide to past research and future opportunities in the area of PPDDM.

Download Full-text

Automatic de-identification of data download packages

Data Science ◽

10.3233/ds-210035 ◽

2021 ◽

pp. 1-20

Author(s):

Laura Boeschoten ◽

Roos Voorvaart ◽

Ruben Van Den Goorbergh ◽

Casper Kaandorp ◽

Martine De Vos

Keyword(s):

Private Information ◽

Personal Data ◽

Identification Algorithm ◽

Social Scientists ◽

Public And Private ◽

General Data Protection Regulation ◽

File Formats ◽

File Structures ◽

The Right ◽

Textual Content

The General Data Protection Regulation (GDPR) grants all natural persons the right to access their personal data if this is being processed by data controllers. The data controllers are obliged to share the data in an electronic format and often provide the data in a so called Data Download Package (DDP). These DDPs contain all data collected by public and private entities during the course of a citizens’ digital life and form a treasure trove for social scientists. However, the data can be deeply private. To protect the privacy of research participants while using their DDPs for scientific research, we developed a de-identification algorithm that is able to handle typical characteristics of DDPs. These include regularly changing file structures, visual and textual content, differing file formats, differing file structures and private information like usernames. We investigate the performance of the algorithm and illustrate how the algorithm can be tailored towards specific DDP structures.

Download Full-text

Deep learning based network similarity for model selection

Data Science ◽

10.3233/ds-210033 ◽

2021 ◽

pp. 1-21

Author(s):

Kushal Veer Singh ◽

Ajay Kumar Verma ◽

Lovekesh Vig

Keyword(s):

Deep Learning ◽

Model Selection ◽

Complex Networks ◽

Real World ◽

Network Architecture ◽

Large Scale ◽

Generative Models ◽

Generative Model ◽

Data Set ◽

Feed Forward Network

Capturing data in the form of networks is becoming an increasingly popular approach for modeling, analyzing and visualising complex phenomena, to understand the important properties of the underlying complex processes. Access to many large-scale network datasets is restricted due to the privacy and security concerns. Also for several applications (such as functional connectivity networks), generating large scale real data is expensive. For these reasons, there is a growing need for advanced mathematical and statistical models (also called generative models) that can account for the structure of these large-scale networks, without having to materialize them in the real world. The objective is to provide a comprehensible description of the network properties and to be able to infer previously unobserved properties. Various models have been developed by researchers, which generate synthetic networks that adhere to the structural properties of real networks. However, the selection of the appropriate generative model for a given real-world network remains an important challenge. In this paper, we investigate this problem and provide a novel technique (named as TripletFit) for model selection (or network classification) and estimation of structural similarities of the complex networks. The goal of network model selection is to select a generative model that is able to generate a structurally similar synthetic network for a given real-world (target) network. We consider six outstanding generative models as the candidate models. The existing model selection methods mostly suffer from sensitivity to network perturbations, dependency on the size of the networks, and low accuracy. To overcome these limitations, we considered a broad array of network features, with the aim of representing different structural aspects of the network and employed deep learning techniques such as deep triplet network architecture and simple feed-forward network for model selection and estimation of structural similarities of the complex networks. Our proposed method, outperforms existing methods with respect to accuracy, noise-tolerance, and size independence on a number of gold standard data set used in previous studies.

Download Full-text

BioVenn – an R and Python package for the comparison and visualization of biological lists using area-proportional Venn diagrams

Data Science ◽

10.3233/ds-210032 ◽

2021 ◽

pp. 1-11

Author(s):

Tim Hulsen

Keyword(s):

R Package ◽

Venn Diagram ◽

Data Sets ◽

Web Interface ◽

Web Browser ◽

Venn Diagrams ◽

File Formats ◽

Source Data ◽

Copy And Paste ◽

Python Package

One of the most popular methods to visualize the overlap and differences between data sets is the Venn diagram. Venn diagrams are especially useful when they are ‘area-proportional’ i.e. the sizes of the circles and the overlaps correspond to the sizes of the data sets. In 2007, the BioVenn web interface was launched, which is being used by many researchers. However, this web implementation requires users to copy and paste (or upload) lists of IDs into the web browser, which is not always convenient and makes it difficult for researchers to create Venn diagrams ‘in batch’, or to automatically update the diagram when the source data changes. This is only possible by using software such as R or Python. This paper describes the BioVenn R and Python packages, which are very easy-to-use packages that can generate accurate area-proportional Venn diagrams of two or three circles directly from lists of (biological) IDs. The only required input is two or three lists of IDs. Optional parameters include the main title, the subtitle, the printing of absolute numbers or percentages within the diagram, colors and fonts. The function can show the diagram on the screen, or it can export the diagram in one of the supported file formats. The function also returns all thirteen lists. The BioVenn R package and Python package were created for biological IDs, but they can be used for other IDs as well. Finally, BioVenn can map Affymetrix and EntrezGene to Ensembl IDs. The BioVenn R package is available in the CRAN repository, and can be installed by running ‘install.packages(“BioVenn”)’. The BioVenn Python package is available in the PyPI repository, and can be installed by running ‘pip install BioVenn’. The BioVenn web interface remains available at https://www.biovenn.nl.

Download Full-text

WORCS: A workflow for open reproducible code in science

Data Science ◽

10.3233/ds-210031 ◽

2021 ◽

pp. 1-21

Author(s):

Caspar J. Van Lissa ◽

Andreas M. Brandmaier ◽

Loek Brinkman ◽

Anna-Lena Lamprecht ◽

Aaron Peikert ◽

...

Keyword(s):

Best Practices ◽

Source Code ◽

R Package ◽

Open Science ◽

Research Projects ◽

Tabular Data ◽

Step Procedure ◽

Starting Point ◽

Conducting Research ◽

And Training

Adopting open science principles can be challenging, requiring conceptual education and training in the use of new tools. This paper introduces the Workflow for Open Reproducible Code in Science (WORCS): A step-by-step procedure that researchers can follow to make a research project open and reproducible. This workflow intends to lower the threshold for adoption of open science principles. It is based on established best practices, and can be used either in parallel to, or in absence of, top-down requirements by journals, institutions, and funding bodies. To facilitate widespread adoption, the WORCS principles have been implemented in the R package worcs, which offers an RStudio project template and utility functions for specific workflow steps. This paper introduces the conceptual workflow, discusses how it meets different standards for open science, and addresses the functionality provided by the R implementation, worcs. This paper is primarily targeted towards scholars conducting research projects in R, conducting research that involves academic prose, analysis code, and tabular data. However, the workflow is flexible enough to accommodate other scenarios, and offers a starting point for customized solutions. The source code for the R package and manuscript, and a list of examplesof WORCS projects, are available at https://github.com/cjvanlissa/worcs.

Download Full-text

Modelling and predicting User Engagement in mobile applications

Data Science ◽

10.3233/ds-190027 ◽

2020 ◽

Vol 3 (2) ◽

pp. 61-77

Author(s):

Eduardo Barbaro ◽

Eoin Martino Grua ◽

Ivano Malavolta ◽

Mirjana Stercevic ◽

Esther Weusthof ◽

...

Keyword(s):

Random Forest ◽

Mobile Applications ◽

Negative Binomial ◽

Numerical Models ◽

Cox Proportional Hazards ◽

Mobile App ◽

User Engagement ◽

Tree Model ◽

Clustering Model ◽

Boosted Tree

The mobile ecosystem is dramatically growing towards an unprecedented scale, with an extremely crowded market and fierce competition among app developers. Today, keeping users engaged with a mobile app is key for its success since users can remain active consumers of services and/or producers of new contents. However, users may abandon a mobile app at any time due to various reasons, e.g., the success of competing apps, decrease of interest in the provided services, etc. In this context, predicting when a user may get disengaged from an app is an invaluable resource for developers, creating the opportunity to apply intervention strategies aiming at recovering from disengagement (e.g., sending push notifications with new contents).In this study, we aim at providing evidence that predicting when mobile app users get disengaged is possible with a good level of accuracy. Specifically, we propose, apply, and evaluate a framework to model and predict User Engagement (UE) in mobile applications via different numerical models. The proposed framework is composed of an optimized agglomerative hierarchical clustering model coupled to (i) a Cox proportional hazards, (ii) a negative binomial, (iii) a random forest, and (iv) a boosted-tree model. The proposed framework is empirically validated by means of a year-long observational dataset collected from a real deployment of a waste recycling app. Our results show that in this context the optimized clustering model classifies users adequately and improves UE predictability for all numerical models. Also, the highest levels of prediction accuracy and robustness are obtained by applying either the random forest classifier or the boosted-tree algorithm.

Download Full-text

Reinforcement learning for personalization: A systematic literature review

Data Science ◽

10.3233/ds-200028 ◽

2020 ◽

Vol 3 (2) ◽

pp. 107-147 ◽

Cited By ~ 1

Author(s):

Floris den Hengst ◽

Eoin Martino Grua ◽

Ali el Hassouni ◽

Mark Hoogendoorn

Keyword(s):

Reinforcement Learning ◽

Literature Review ◽

Systematic Literature Review ◽

Digital Systems ◽

Continuous Control ◽

Game Playing ◽

Realistic Evaluation ◽

Evaluation Strategies ◽

Major Application ◽

Future Work

The major application areas of reinforcement learning (RL) have traditionally been game playing and continuous control. In recent years, however, RL has been increasingly applied in systems that interact with humans. RL can personalize digital systems to make them more relevant to individual users. Challenges in personalization settings may be different from challenges found in traditional application areas of RL. An overview of work that uses RL for personalization, however, is lacking. In this work, we introduce a framework of personalization settings and use it in a systematic literature review. Besides setting, we review solutions and evaluation strategies. Results show that RL has been increasingly applied to personalization problems and realistic evaluations have become more prevalent. RL has become sufficiently robust to apply in contexts that involve humans and the field as a whole is growing. However, it seems not to be maturing: the ratios of studies that include a comparison or a realistic evaluation are not showing upward trends and the vast majority of algorithms are used only once. This review can be used to find related work across domains, provides insights into the state of the field and identifies opportunities for future work.

Download Full-text

Data Science
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Ios Press

Packaging research artefacts with RO-Crate

Implementation of marketplace data in the production of Consumer Price Index in Indonesia

Arangopipe, a tool for machine learning meta-data management

A systematic review on privacy-preserving distributed data mining

Automatic de-identification of data download packages

Deep learning based network similarity for model selection

BioVenn – an R and Python package for the comparison and visualization of biological lists using area-proportional Venn diagrams

WORCS: A workflow for open reproducible code in science

Modelling and predicting User Engagement in mobile applications

Reinforcement learning for personalization: A systematic literature review

Export Citation Format

Data ScienceLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Ios Press

Packaging research artefacts with RO-Crate

Implementation of marketplace data in the production of Consumer Price Index in Indonesia

Arangopipe, a tool for machine learning meta-data management

A systematic review on privacy-preserving distributed data mining

Automatic de-identification of data download packages

Deep learning based network similarity for model selection

BioVenn – an R and Python package for the comparison and visualization of biological lists using area-proportional Venn diagrams

WORCS: A workflow for open reproducible code in science

Modelling and predicting User Engagement in mobile applications

Reinforcement learning for personalization: A systematic literature review

Data Science
Latest Publications