Improvements for research data repositories: The case of text spam

2021 ◽  
pp. 016555152199863
Author(s):  
Ismael Vázquez ◽  
María Novo-Lourés ◽  
Reyes Pavón ◽  
Rosalía Laza ◽  
José Ramón Méndez ◽  
...  

Current research has evolved in such a way scientists must not only adequately describe the algorithms they introduce and the results of their application, but also ensure the possibility of reproducing the results and comparing them with those obtained through other approximations. In this context, public data sets (sometimes shared through repositories) are one of the most important elements for the development of experimental protocols and test benches. This study has analysed a significant number of CS/ML ( Computer Science/ Machine Learning) research data repositories and data sets and detected some limitations that hamper their utility. Particularly, we identify and discuss the following demanding functionalities for repositories: (1) building customised data sets for specific research tasks, (2) facilitating the comparison of different techniques using dissimilar pre-processing methods, (3) ensuring the availability of software applications to reproduce the pre-processing steps without using the repository functionalities and (4) providing protection mechanisms for licencing issues and user rights. To show the introduced functionality, we created STRep (Spam Text Repository) web application which implements our recommendations adapted to the field of spam text repositories. In addition, we launched an instance of STRep in the URL https://rdata.4spam.group to facilitate understanding of this study.

2020 ◽  
Author(s):  
Anna M. Sozanska ◽  
Charles Fletcher ◽  
Dóra Bihary ◽  
Shamith A. Samarajiwa

AbstractMore than three decades ago, the microarray revolution brought about high-throughput data generation capability to biology and medicine. Subsequently, the emergence of massively parallel sequencing technologies led to many big-data initiatives such as the human genome project and the encyclopedia of DNA elements (ENCODE) project. These, in combination with cheaper, faster massively parallel DNA sequencing capabilities, have democratised multi-omic (genomic, transcriptomic, translatomic and epigenomic) data generation leading to a data deluge in bio-medicine. While some of these data-sets are trapped in inaccessible silos, the vast majority of these data-sets are stored in public data resources and controlled access data repositories, enabling their wider use (or misuse). Currently, most peer reviewed publications require the deposition of the data-set associated with a study under consideration in one of these public data repositories. However, clunky and difficult to use interfaces, subpar or incomplete annotation prevent discovering, searching and filtering of these multi-omic data and hinder their re-purposing in other use cases. In addition, the proliferation of multitude of different data repositories, with partially redundant storage of similar data are yet another obstacle to their continued usefulness. Similarly, interfaces where annotation is spread across multiple web pages, use of accession identifiers with ambiguous and multiple interpretations and lack of good curation make these data-sets difficult to use. We have produced SpiderSeqR, an R package, whose main features include the integration between NCBI GEO and SRA databases, enabling an integrated unified search of SRA and GEO data-sets and associated annotations, conversion between database accessions, as well as convenient filtering of results and saving past queries for future use. All of the above features aim to promote data reuse to facilitate making new discoveries and maximising the potential of existing data-sets.Availabilityhttps://github.com/ss-lab-cancerunit/SpiderSeqR


Author(s):  
Liah Shonhe

The main focus of the study was to explore the practices of open data sharing in the agricultural sector, including establishing the research outputs concerning open data in agriculture. The study adopted a desktop research methodology based on literature review and bibliographic data from WoS database. Bibliometric indicators discussed include yearly productivity, most prolific authors, and enhanced countries. Study findings revealed that research activity in the field of agriculture and open access is very low. There were 36 OA articles and only 6 publications had an open data badge. Most researchers do not yet embrace the need to openly publish their data set despite the availability of numerous open data repositories. Unfortunately, most African countries are still lagging behind in management of agricultural open data. The study therefore recommends that researchers should publish their research data sets as OA. African countries need to put more efforts in establishing open data repositories and implementing the necessary policies to facilitate OA.


2019 ◽  
Author(s):  
Reinder Broekstra ◽  
Els Maeckelberghe ◽  
Judith Aris-Meijer ◽  
Ronald Stolk ◽  
Sabine Otten

Abstract Background: Large-scale, centralized data repositories are playing a critical and unprecedented role in fostering innovative health research, leading to new opportunities as well as dilemmas for the medical sciences. Uncovering the reasons as to why citizens do or do not contribute to such repositories, for example, to population-based biobanks, is therefore crucial. We investigated and compared the views of existing participants and non-participants on contributing to large-scale, centralized health research data repositories with those of ex-participants regarding the decision to end their participation. This comparison could yield new insights into motives of participation and non-participation, in particular the behavioural change of withdrawal. Methods: We conducted 36 in-depth interviews with ex-participants, participants, and non-participants of a three-generation, population-based biobank in the Netherlands. The interviews focused on the respondents’ decision-making processes relating to their participation in a large-scale, centralized repository for health research data. Results: The decision of participants and non-participants to contribute to the biobank was motivated by a desire to help others. Whereas participants perceived only benefits relating to their participation and were unconcerned about potential risks, non­-participants and ex-participants raised concerns about the threat of large-scale, centralized public data repositories and public institutes, such as social exclusion or commercialization. Our analysis of ex-participants’ perceptions suggests that intrapersonal characteristics, such as levels of trust in society and public goods, participation conceived as a social norm, and basic societal values account for differences between participants and non-participants. Conclusions: Our findings indicate the fluidity of motives centring on helping others in decisions to participate in large-scale, centralized health research data repositories. Efforts to improve participation should focus on enhancing the trustworthiness of such data repositories and developing layered strategies for communication with participants and with the public. Accordingly, personalized approaches for recruiting participants and transmitting information along with appropriate regulatory frameworks are required, which have important implications for current data management and informed consent procedures.


2020 ◽  
Author(s):  
Reinder Broekstra ◽  
Els Maeckelberghe ◽  
Judith Aris-Meijer ◽  
Ronald Stolk ◽  
Sabine Otten

Abstract Background: Large-scale, centralized data repositories are playing a critical and unprecedented role in fostering innovative health research, leading to new opportunities as well as dilemmas for the medical sciences. Uncovering the reasons as to why citizens do or do not contribute to such repositories, for example, to population-based biobanks, is therefore crucial. We investigated and compared the views of existing participants and non-participants on contributing to large-scale, centralized health research data repositories with those of ex-participants regarding the decision to end their participation. This comparison could yield new insights into motives of participation and non-participation, in particular the behavioural change of withdrawal. Methods: We conducted 36 in-depth interviews with ex-participants, participants, and non-participants of a three-generation, population-based biobank in the Netherlands. The interviews focused on the respondents’ decision-making processes relating to their participation in a large-scale, centralized repository for health research data. Results: The decision of participants and non-participants to contribute to the biobank was motivated by a desire to help others. Whereas participants perceived only benefits relating to their participation and were unconcerned about potential risks, non­-participants and ex-participants raised concerns about the threat of large-scale, centralized public data repositories and public institutes, such as social exclusion or commercialization. Our analysis of ex-participants’ perceptions suggests that intrapersonal characteristics, such as levels of trust in society and public goods, participation conceived as a social norm, and basic societal values account for differences between participants and non-participants.Conclusions: Our findings indicate the fluidity of motives centring on helping others in decisions to participate in large-scale, centralized health research data repositories. Efforts to improve participation should focus on enhancing the trustworthiness of such data repositories and developing layered strategies for communication with participants and with the public. Accordingly, personalized approaches for recruiting participants and transmitting information along with appropriate regulatory frameworks are required, which have important implications for current data management and informed consent procedures.


2020 ◽  
Vol 23 (1) ◽  
Author(s):  
Eder Ávila-Barrientos

El objetivo de este trabajo consiste en analizar los principios teórico-metodológicos relacionados con la descripción de los datos de investigación. Se realizó un análisis sobre el estado de la cuestión de los datos de investigación, en cual se abordan aspectos de su citación, descripción y sistematización. Se identificaron y analizaron los elementos de metadatos para la descripción de conjuntos de datos de investigación que se incluyen en el DataCite Metadata Schema, con el propósito de crear una propuesta de perfil descriptivo aplicable a estos conjuntos. Se estima que, si los datos de investigación se encuentran debidamente descritos, entonces se fomentará en mayor grado su accesibilidad y reutilización. Para ello, es necesario que las instituciones académicas y de investigación participen en la generación de políticas de acceso abierto a sus datos de investigación. The objective of this work is to analyze the theoretical-methodological principles related to the description and accessibility of research data. Hermeneutics and discourse analysis were applied to literature specialized in: research data; access and description of research data; data repositories. Metadata elements for the description of research datasets that are included in the DataCite Metadata Schema were identified and analyzed, in order to create a descriptive profile proposal for research data sets, which can be applied in the data repositories. If the research data is properly described, then its accessibility and reuse will be further promoted. To do this, it is necessary for academic and research institutions to participate in the generation of open access policies for their research data.


KWALON ◽  
2016 ◽  
Vol 21 (1) ◽  
Author(s):  
René van Horik

Summary Nowadays, research without a role for digital data and data analysis tools is barely possible. As a result, we see an increasing interest in research data management, as this enables the replication of research outcomes and the reuse of research data for new research activities. Data management planning outlines how to handle data, both during research and after the research is completed. Trusted data repositories are places were research data are archived and made available for the long term. This article covers the state of the art concerning data management and data repository demands with a focus on qualitative data sets.


2015 ◽  
Vol 24 (02) ◽  
pp. 1540008 ◽  
Author(s):  
Albert Weichselbraun ◽  
Daniel Streiff ◽  
Arno Scharl

Linking named entities to structured knowledge sources paves the way for state-of-the-art Web intelligence applications which assign sentiment to the correct entities, identify trends, and reveal relations between organizations, persons and products. For this purpose this paper introduces Recognyze, a named entity linking component that uses background knowledge obtained from linked data repositories, and outlines the process of transforming heterogeneous data silos within an organization into a linked enterprise data repository which draws upon popular linked open data vocabularies to foster interoperability with public data sets. The presented examples use comprehensive real-world data sets from Orell Füssli Business Information, Switzerland's largest business information provider. The linked data repository created from these data sets comprises more than nine million triples on companies, the companies' contact information, key people, products and brands. We identify the major challenges of tapping into such sources for named entity linking, and describe required data pre-processing techniques to use and integrate such data sets, with a special focus on disambiguation and ranking algorithms. Finally, we conduct a comprehensive evaluation based on business news from the New Journal of Zurich and AWP Financial News to illustrate how these techniques improve the performance of the Recognyze named entity linking component.


PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e6695 ◽  
Author(s):  
Andrea Garretto ◽  
Thomas Hatzopoulos ◽  
Catherine Putonti

Metagenomics has enabled sequencing of viral communities from a myriad of different environments. Viral metagenomic studies routinely uncover sequences with no recognizable homology to known coding regions or genomes. Nevertheless, complete viral genomes have been constructed directly from complex community metagenomes, often through tedious manual curation. To address this, we developed the software tool virMine to identify viral genomes from raw reads representative of viral or mixed (viral and bacterial) communities. virMine automates sequence read quality control, assembly, and annotation. Researchers can easily refine their search for a specific study system and/or feature(s) of interest. In contrast to other viral genome detection tools that often rely on the recognition of viral signature sequences, virMine is not restricted by the insufficient representation of viral diversity in public data repositories. Rather, viral genomes are identified through an iterative approach, first omitting non-viral sequences. Thus, both relatives of previously characterized viruses and novel species can be detected, including both eukaryotic viruses and bacteriophages. Here we present virMine and its analysis of synthetic communities as well as metagenomic data sets from three distinctly different environments: the gut microbiota, the urinary microbiota, and freshwater viromes. Several new viral genomes were identified and annotated, thus contributing to our understanding of viral genetic diversity in these three environments.


2016 ◽  
Author(s):  
Fritz Lekschas ◽  
Nils Gehlenborg

AbstractThe ever-increasing number of biomedical data sets provides tremendous opportunities for re-use but current data repositories provide limited means of exploration apart from text-based search. Ontological metadata annotations provide context by semantically relating data sets. Visualizing this rich network of relationships can improve the explorability of large data repositories and help researchers find data sets of interest. We developed SATORI—an integrative search and visual exploration interface for the exploration of biomedical data repositories. The design is informed by a requirements analysis through a series of semi-structured interviews. We evaluated the implementation of SATORI in a field study on a real-world data collection.SATORI enables researchers to seamlessly search, browse, and semantically query data repositories via two visualizations that are highly interconnected with a powerful search interface. SATORI is an open-source web application,which is freely available at http://satori.refinery-platform.org and integrated into the Refinery Platform.


2020 ◽  
Author(s):  
Reinder Broekstra ◽  
Els Maeckelberghe ◽  
Judith Aris-Meijer ◽  
Ronald Stolk ◽  
Sabine Otten

Abstract Background: Large-scale, centralized data repositories are playing a critical and unprecedented role in fostering innovative health research, leading to new opportunities as well as dilemmas for the medical sciences. Uncovering the reasons as to why citizens do or do not contribute to such repositories, for example, to population-based biobanks, is therefore crucial. We investigated and compared the views of existing participants and non-participants on contributing to large-scale, centralized health research data repositories with those of ex-participants regarding the decision to end their participation. This comparison could yield new insights into motives of participation and non-participation, in particular the behavioural change of withdrawal. Methods: We conducted 36 in-depth interviews with ex-participants, participants, and non-participants of a three-generation, population-based biobank in the Netherlands. The interviews focused on the respondents’ decision-making processes relating to their participation in a large-scale, centralized repository for health research data. Results: The decision of participants and non-participants to contribute to the biobank was motivated by a desire to help others. Whereas participants perceived only benefits relating to their participation and were unconcerned about potential risks, non­-participants and ex-participants raised concerns about the threat of large-scale, centralized public data repositories and public institutes, such as social exclusion or commercialization. Our analysis of ex-participants’ perceptions suggests that intrapersonal characteristics, such as levels of trust in society and public goods, participation conceived as a social norm, and basic societal values account for differences between participants and non-participants.Conclusions: Our findings indicate the fluidity of motives centring on helping others in decisions to participate in large-scale, centralized health research data repositories. Efforts to improve participation should focus on enhancing the trustworthiness of such data repositories and developing layered strategies for communication with participants and with the public. Accordingly, personalized approaches for recruiting participants and transmitting information along with appropriate regulatory frameworks are required, which have important implications for current data management and informed consent procedures.


Sign in / Sign up

Export Citation Format

Share Document